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Chapter 1 

Machine learning 


For the journal, see Machine Learning (journal). 

Machine learning is a subfield of computer science 11 ' 
that evolved from the study of pattern recognition and 
computational learning theory in artificial intelligence. 1 ' 1 
Machine learning explores the construction and study of 
algorithms that can learn from and make predictions on 
data. 12 ' Such algorithms operate by building a model from 
example inputs in order to make data-driven predictions 
or decisions,' 31:2 rather than following strictly static pro¬ 
gram instructions. 

Machine learning is closely related to and often over¬ 
laps with computational statistics; a discipline that also 
specializes in prediction-making. It has strong ties to 
mathematical optimization, which deliver methods, the¬ 
ory and application domains to the field. Machine learn¬ 
ing is employed in a range of computing tasks where 
designing and programming explicit algorithms is in¬ 
feasible. Example applications include spam filtering, 
optical character recognition (OCR),' 4 ' search engines 
and computer vision. Machine learning is sometimes 
conflated with data mining,' 5 ' although that focuses more 
on exploratory data analysis.' 6 ' Machine learning and pat¬ 
tern recognition “can be viewed as two facets of the same 
field.”' 3 '™ 

When employed in industrial contexts, machine learn¬ 
ing methods may be referred to as predictive analytics or 
predictive modelling. 

1.1 Overview 

In 1959, Arthur Samuel defined machine learning as a 
“Field of study that gives computers the ability to learn 
without being explicitly programmed”.' 7 ' 

Tom M. Mitchell provided a widely quoted, more for¬ 
mal definition: “A computer program is said to learn 
from experience E with respect to some class of tasks T 
and performance measure P, if its performance at tasks 
in T, as measured by P, improves with experience E”.' 8 ' 
This definition is notable for its defining machine learn¬ 
ing in fundamentally operational rather than cognitive 
terms, thus following Alan Turing's proposal in his paper 


"Computing Machinery and Intelligence" that the ques¬ 
tion “Can machines think?" be replaced with the ques¬ 
tion “Can machines do what we (as thinking entities) can 
do?" [9] 

1.1.1 Types of problems and tasks 

Machine learning tasks are typically classified into three 
broad categories, depending on the nature of the learn¬ 
ing “signal” or “feedback” available to a learning system. 
These are:' 10 ' 

• Supervised learning: The computer is presented 
with example inputs and their desired outputs, given 
by a “teacher”, and the goal is to learn a general rule 
that maps inputs to outputs. 

• Unsupervised learning: No labels are given to the 
learning algorithm, leaving it on its own to find struc¬ 
ture in its input. Unsupervised learning can be a goal 
in itself (discovering hidden patterns in data) or a 
means towards an end. 

• Reinforcement learning: A computer program in¬ 
teracts with a dynamic environment in which it must 
perform a certain goal (such as driving a vehicle), 
without a teacher explicitly telling it whether it has 
come close to its goal or not. Another example 
is learning to play a game by playing against an 
opponent.' 3 ' 3 

Between supervised and unsupervised learning is semi- 
supervised learning, where the teacher gives an incom¬ 
plete training signal: a training set with some (often 
many) of the target outputs missing. Transduction is a 
special case of this principle where the entire set of prob¬ 
lem instances is known at learning time, except that part 
of the targets are missing. 

Among other categories of machine learning problems, 
learning to learn learns its own inductive bias based on 
previous experience. Developmental learning, elabo¬ 
rated for robot learning, generates its own sequences (also 
called curriculum) of learning situations to cumulatively 
acquire repertoires of novel skills through autonomous 
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A support vector machine is a classifier that divides its input space 
into two regions, separated by a linear boundary. Here, it has 
learned to distinguish black and white circles. 

self-exploration and social interaction with human teach¬ 
ers, and using guidance mechanisms such as active learn¬ 
ing, maturation, motor synergies, and imitation. 

Another categorization of machine learning tasks arises 
when one considers the desired output of a machine- 
learned system:' 303 

• In classification, inputs are divided into two or more 
classes, and the learner must produce a model that 
assigns unseen inputs to one (or multi-label classi¬ 
fication) or more of these classes. This is typically 
tackled in a supervised way. Spam filtering is an ex¬ 
ample of classification, where the inputs are email 
(or other) messages and the classes are “spam” and 
“not spam”. 

• In regression, also a supervised problem, the outputs 
are continuous rather than discrete. 

• In clustering, a set of inputs is to be divided into 
groups. Unlike in classification, the groups are not 
known beforehand, making this typically an unsu¬ 
pervised task. 

• Density estimation finds the distribution of inputs in 
some space. 

• Dimensionality reduction simplifies inputs by map¬ 
ping them into a lower-dimensional space. Topic 
modeling is a related problem, where a program is 
given a list of human language documents and is 
tasked to find out which documents cover similar 
topics. 


1.2 History and relationships to 
other fields 

As a scientific endeavour, machine learning grew out 
of the quest for artificial intelligence. Already in the 
early days of AI as an academic discipline, some re¬ 
searchers were interested in having machines learn from 
data. They attempted to approach the problem with vari¬ 
ous symbolic methods, as well as what were then termed 
"neural networks"; these were mostly perceptrons and 
other models that were later found to be reinventions of 
the generalized linear models of statistics. Probabilistic 
reasoning was also employed, especially in automated 
medical diagnosis.' 10 ° 488 

However, an increasing emphasis on the logical, 
knowledge-based approach caused a rift between AI and 
machine learning. Probabilistic systems were plagued 
by theoretical and practical problems of data acquisition 
and representation. 1101488 By 1980, expert systems had 
come to dominate AI, and statistics was out of favor.' 1 ' 1 
Work on symbolic/knowledge-based learning did con¬ 
tinue within AI, leading to inductive logic programming, 
but the more statistical line of research was now out¬ 
side the field of AI proper, in pattern recognition and 
information retrieval.' 100708-710 ' 753 Neural networks re¬ 
search had been abandoned by AI and computer science 
around the same time. This line, too, was continued out¬ 
side the AI/CS field, as "connectionism", by researchers 
from other disciplines including Hopfield, Rumelhart and 
Hinton. Their main success came in the mid-1980s with 
the reinvention of backpropagation.' 10 ' 25 

Machine learning, reorganized as a separate field, started 
to flourish in the 1990s. The field changed its goal from 
achieving artificial intelligence to tackling solvable prob¬ 
lems of a practical nature. It shifted focus away from 
the symbolic approaches it had inherited from AI, and 
toward methods and models borrowed from statistics and 
probability theory.' 11 ' It also benefited from the increas¬ 
ing availability of digitized information, and the possibil¬ 
ity to distribute that via the internet. 

Machine learning and data mining often employ the same 
methods and overlap significantly. They can be roughly 
distinguished as follows: 

• Machine learning focuses on prediction, based on 
known properties learned from the training data. 

• Data mining focuses on the discovery of (previously) 
unknown properties in the data. This is the analysis 
step of Knowledge Discovery in Databases. 

The two areas overlap in many ways: data mining uses 
many machine learning methods, but often with a slightly 
different goal in mind. On the other hand, machine 
learning also employs data mining methods as “unsuper¬ 
vised learning” or as a preprocessing step to improve 
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learner accuracy. Much of the confusion between these 
two research communities (which do often have sepa¬ 
rate conferences and separate journals, ECML PKDD 
being a major exception) comes from the basic assump¬ 
tions they work with: in machine learning, performance 
is usually evaluated with respect to the ability to re¬ 
produce known knowledge, while in Knowledge Discov¬ 
ery and Data Mining (KDD) the key task is the discov¬ 
ery of previously unknown knowledge. Evaluated with 
respect to known knowledge, an uninformed (unsuper¬ 
vised) method will easily be outperformed by supervised 
methods, while in a typical KDD task, supervised meth¬ 
ods cannot be used due to the unavailability of training 
data. 

Machine learning also has intimate ties to optimization: 
many learning problems are formulated as minimization 
of some loss function on a training set of examples. Loss 
functions express the discrepancy between the predic¬ 
tions of the model being trained and the actual prob¬ 
lem instances (for example, in classification, one wants 
to assign a label to instances, and models are trained 
to correctly predict the pre-assigned labels of a set ex¬ 
amples). The difference between the two fields arises 
from the goal of generalization: while optimization algo¬ 
rithms can minimize the loss on a training set, machine 
learning is concerned with minimizing the loss on unseen 
samples. 1121 

1.2.1 Relation to statistics 

Machine learning and statistics are closely related fields. 
According to Michael I. Jordan, the ideas of machine 
learning, from methodological principles to theoretical 
tools, have had a long pre-history in statistics. 1131 He also 
suggested the term data science as a placeholder to call 
the overall field. 1131 

Leo Breiman distinguished two statistical modelling 
paradigms: data model and algorithmic model, 1141 
wherein 'algorithmic model' means more or less the ma¬ 
chine learning algorithms like Random forest. 

Some statisticians have adopted methods from machine 
learning, leading to a combined field that they call statis¬ 
tical learning , 1151 


1.3 Theory 

Main article: Computational learning theory 

A core objective of a learner is to generalize from its 
experience. 1311161 Generalization in this context is the abil¬ 
ity of a learning machine to perform accurately on new, 
unseen examples/tasks after having experienced a learn¬ 
ing data set. The training examples come from some gen¬ 
erally unknown probability distribution (considered rep¬ 


resentative of the space of occurrences) and the learner 
has to build a general model about this space that en¬ 
ables it to produce sufficiently accurate predictions in new 
cases. 

The computational analysis of machine learning algo¬ 
rithms and their performance is a branch of theoretical 
computer science known as computational learning the¬ 
ory. Because training sets are finite and the future is un¬ 
certain, learning theory usually does not yield guarantees 
of the performance of algorithms. Instead, probabilis¬ 
tic bounds on the performance are quite common. The 
bias-variance decomposition is one way to quantify gen¬ 
eralization error. 

In addition to performance bounds, computational learn¬ 
ing theorists study the time complexity and feasibility of 
learning. In computational learning theory, a computa¬ 
tion is considered feasible if it can be done in polynomial 
time. There are two kinds of time complexity results. 
Positive results show that a certain class of functions can 
be learned in polynomial time. Negative results show that 
certain classes cannot be learned in polynomial time. 

There are many similarities between machine learning 
theory and statistical inference, although they use differ¬ 
ent terms. 


1.4 Approaches 

Main article: List of machine learning algorithms 


1.4.1 Decision tree learning 

Main article: Decision tree learning 

Decision tree learning uses a decision tree as a predictive 
model, which maps observations about an item to conclu¬ 
sions about the item’s target value. 

1.4.2 Association rule learning 

Main article: Association rule learning 

Association rule learning is a method for discovering in¬ 
teresting relations between variables in large databases. 

1.4.3 Artificial neural networks 

Main article: Artificial neural network 

An artificial neural network (ANN) learning algorithm, 
usually called “neural network” (NN), is a learning al¬ 
gorithm that is inspired by the structure and func- 
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tional aspects of biological neural networks. Compu¬ 
tations are structured in terms of an interconnected 
group of artificial neurons, processing information using 
a connectionist approach to computation. Modern neu¬ 
ral networks are non-linear statistical data modeling tools. 
They are usually used to model complex relationships be¬ 
tween inputs and outputs, to find patterns in data, or to 
capture the statistical structure in an unknown joint prob¬ 
ability distribution between observed variables. 

1.4.4 Inductive logic programming 

Main article: Inductive logic programming 

Inductive logic programming (ILP) is an approach to rule 
learning using logic programming as a uniform represen¬ 
tation for input examples, background knowledge, and 
hypotheses. Given an encoding of the known background 
knowledge and a set of examples represented as a log¬ 
ical database of facts, an ILP system will derive a hy¬ 
pothesized logic program that entails all positive and no 
negative examples. Inductive programming is a related 
field that considers any kind of programming languages 
for representing hypotheses (and not only logic program¬ 
ming), such as functional programs. 

1.4.5 Support vector machines 

Main article: Support vector machines 

Support vector machines (SVMs) are a set of related 
supervised learning methods used for classification and 
regression. Given a set of training examples, each marked 
as belonging to one of two categories, an SVM training 
algorithm builds a model that predicts whether a new ex¬ 
ample falls into one category or the other. 

1.4.6 Clustering 

Main article: Cluster analysis 

Cluster analysis is the assignment of a set of observations 
into subsets (called clusters) so that observations within 
the same cluster are similar according to some predes¬ 
ignated criterion or criteria, while observations drawn 
from different clusters are dissimilar. Different cluster¬ 
ing techniques make different assumptions on the struc¬ 
ture of the data, often defined by some similarity metric 
and evaluated for example by internal compactness (simi¬ 
larity between members of the same cluster) and separa¬ 
tion between different clusters. Other methods are based 
on estimated density and graph connectivity. Clustering is 
a method of unsupervised learning, and a common tech¬ 
nique for statistical data analysis. 


1.4.7 Bayesian networks 

Main article: Bayesian network 

A Bayesian network, belief network or directed acyclic 
graphical model is a probabilistic graphical model that 
represents a set of random variables and their conditional 
independencies via a directed acyclic graph (DAG). For 
example, a Bayesian network could represent the prob¬ 
abilistic relationships between diseases and symptoms. 
Given symptoms, the network can be used to compute 
the probabilities of the presence of various diseases. Ef¬ 
ficient algorithms exist that perform inference and learn¬ 
ing. 

1.4.8 Reinforcement learning 

Main article: Reinforcement learning 

Reinforcement learning is concerned with how an agent 
ought to take actions in an environment so as to maxi¬ 
mize some notion of long-term reward. Reinforcement 
learning algorithms attempt to find a policy that maps 
states of the world to the actions the agent ought to take 
in those states. Reinforcement learning differs from the 
supervised learning problem in that correct input/output 
pairs are never presented, nor sub-optimal actions explic¬ 
itly corrected. 

1.4.9 Representation learning 

Main article: Representation learning 

Several learning algorithms, mostly unsupervised learn¬ 
ing algorithms, aim at discovering better representations 
of the inputs provided during training. Classical exam¬ 
ples include principal components analysis and cluster 
analysis. Representation learning algorithms often at¬ 
tempt to preserve the information in their input but trans¬ 
form it in a way that makes it useful, often as a pre¬ 
processing step before performing classification or pre¬ 
dictions, allowing to reconstruct the inputs coming from 
the unknown data generating distribution, while not being 
necessarily faithful for configurations that are implausible 
under that distribution. 

Manifold learning algorithms attempt to do so under 
the constraint that the learned representation is low¬ 
dimensional. Sparse coding algorithms attempt to do 
so under the constraint that the learned representation is 
sparse (has many zeros). Multilinear subspace learning 
algorithms aim to learn low-dimensional representations 
directly from tensor representations for multidimensional 
data, without reshaping them into (high-dimensional) 
vectors. 1171 Deep learning algorithms discover multiple 
levels of representation, or a hierarchy of features, with 
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higher-level, more abstract features defined in terms of 
(or generating) lower-level features. It has been argued 
that an intelligent machine is one that learns a represen¬ 
tation that disentangles the underlying factors of variation 
that explain the observed data. 1181 


1.4.10 Similarity and metric learning 

Main article: Similarity learning 

In this problem, the learning machine is given pairs of ex¬ 
amples that are considered similar and pairs of less simi¬ 
lar objects. It then needs to learn a similarity function (or 
a distance metric function) that can predict if new objects 
are similar. It is sometimes used in Recommendation sys¬ 
tems. 


1.4.11 Sparse dictionary learning 

In this method, a datum is represented as a linear com¬ 
bination of basis functions, and the coefficients are as¬ 
sumed to be sparse. Let x be a fif-dimensional datum, D 
be a d by n matrix, where each column of D represents 
a basis function, r is the coefficient to represent x using 
D. Mathematically, sparse dictionary learning means the 
following x s=s Dr where r is sparse. Generally speaking, 
n is assumed to be larger than d to allow the freedom for 
a sparse representation. 

Learning a dictionary along with sparse representa¬ 
tions is strongly NP-hard and also difficult to solve 
approximately. 1191 A popular heuristic method for sparse 
dictionary learning is K-SVD. 

Sparse dictionary learning has been applied in several 
contexts. In classification, the problem is to determine 
which classes a previously unseen datum belongs to. Sup¬ 
pose a dictionary for each class has already been built. 
Then a new datum is associated with the class such that 
it’s best sparsely represented by the corresponding dic¬ 
tionary. Sparse dictionary learning has also been applied 
in image de-noising. The key idea is that a clean image 
patch can be sparsely represented by an image dictionary, 
but the noise cannot. 1201 


1.4.12 Genetic algorithms 

Main article: Genetic algorithm 

A genetic algorithm (GA) is a search heuristic that mim¬ 
ics the process of natural selection, and uses methods such 
as mutation and crossover to generate new genotype in 
the hope of finding good solutions to a given problem. In 
machine learning, genetic algorithms found some uses in 
the 1980s and 1990s. 121 11221 Vice versa, machine learning 


techniques have been used to improve the performance 
of genetic and evolutionary algorithms. 1231 

1.5 Applications 

Applications for machine learning include: 

• Adaptive websites 

• Affective computing 

• Bioinformatics 

• Brain-machine interfaces 

• Cheminformatics 

• Classifying DNA sequences 

• Computational advertising 

• Computational finance 

• Computer vision, including object recognition 

• Detecting credit card fraud 

• Game playing 1241 

• Information retrieval 

• Internet fraud detection 

• Machine perception 

• Medical diagnosis 

• Natural language processing 1251 

• Optimization and metaheuristic 

• Recommender systems 

• Robot locomotion 

• Search engines 

• Sentiment analysis (or opinion mining) 

• Sequence mining 

• Software engineering 

• Speech and handwriting recognition 

• Stock market analysis 

• Structural health monitoring 

• Syntactic pattern recognition 
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In 2006, the online movie company Netflix held the first 
"Netflix Prize" competition to find a program to better 
predict user preferences and improve the accuracy on its 
existing Cinematch movie recommendation algorithm by 
at least 10%. A joint team made up of researchers from 
AT&T Labs-Research in collaboration with the teams Big 
Chaos and Pragmatic Theory built an ensemble model to 
win the Grand Prize in 2009 for $1 million. [26] Shortly 
after the prize was awarded, Netflix realized that view¬ 
ers’ ratings were not the best indicators of their view¬ 
ing patterns (“everything is a recommendation”) and they 
changed their recommendation engine accordingly. 1271 

In 2010 The Wall Street Journal wrote about money man¬ 
agement firm Rebellion Research’s use of machine learn¬ 
ing to predict economic movements. The article de¬ 
scribes Rebellion Research’s prediction of the financial 
crisis and economic recovery. 1281 

In 2014 it has been reported that a machine learning al¬ 
gorithm has been applied in Art History to study fine art 
paintings, and that it may have revealed previously unrec¬ 
ognized influences between artists. [29] 

1.6 Software 

Software suites containing a variety of machine learning 
algorithms include the following: 

1.6.1 Open-source software 

• dlib 

• ELKI 

• Encog 

• H20 

• Mahout 

• mlpy 

• MLPACK 

• MOA (Massive Online Analysis) 

• ND4J with Deeplearning4j 

• OpenCV 

• OpenNN 

• Orange 

• R 

• scikit-learn 

• Shogun 

• Torch (machine learning) 


• Spark 

• Yooreeka 

• Weka 

1.6.2 Commercial software with open- 
source editions 

• KNIME 

• RapidMiner 

1.6.3 Commercial software 

• Amazon Machine Learning 

• Angoss KnowledgeSTUDIO 

• Databricks 

• IBM SPSS Modeler 

• KXEN Modeler 

• LIONsolver 

• Mathematica 

• MATLAB 

• Microsoft Azure Machine Learning 

• Neural Designer 

• NeuroSolutions 

• Oracle Data Mining 

• RCASE 

• SAS Enterprise Miner 

• STATISTICA Data Miner 

1.7 Journals 

• Journal of Machine Learning Research 

• Machine Learning 

• Neural Computation 

1.8 Conferences 

• Conference on Neural Information Processing Sys¬ 
tems 

• International Conference on Machine Learning 
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1.9 See also 

• Adaptive control 

• Adversarial machine learning 

• Automatic reasoning 

• Cache language model 

• Cognitive model 

• Cognitive science 

• Computational intelligence 

• Computational neuroscience 

• Ethics of artificial intelligence 

• Existential risk of artificial general intelligence 

• Explanation-based learning 

• Hidden Markov model 

• Important publications in machine learning 

• List of machine learning algorithms 
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Chapter 2 

Artificial intelligence 


“AI” redirects here. For other uses, see Ai and Artificial 
intelligence (disambiguation). 

Artificial intelligence (AI) is the intelligence exhibited 
by machines or software. It is also the name of the aca¬ 
demic field of study which studies how to create comput¬ 
ers and computer software that are capable of intelligent 
behavior. Major AI researchers and textbooks define this 
field as “the study and design of intelligent agents”, 1 11 in 
which an intelligent agent is a system that perceives its 
environment and takes actions that maximize its chances 
of success. 121 John McCarthy, who coined the term in 
1955, 131 defines it as “the science and engineering of mak¬ 
ing intelligent machines”. 141 

AI research is highly technical and specialized, and is 
deeply divided into subfields that often fail to commu¬ 
nicate with each other. 131 Some of the division is due 
to social and cultural factors: subfields have grown up 
around particular institutions and the work of individual 
researchers. AI research is also divided by several tech¬ 
nical issues. Some subfields focus on the solution of spe¬ 
cific problems. Others focus on one of several possible 
approaches or on the use of a particular tool or towards 
the accomplishment of particular applications. 

The central problems (or goals) of AI research include 
reasoning, knowledge, planning, learning, natural lan¬ 
guage processing (communication), perception and the 
ability to move and manipulate objects. 161 General in¬ 
telligence is still among the field’s long-term goals. 171 
Currently popular approaches include statistical methods, 
computational intelligence and traditional symbolic AI. 
There are a large number of tools used in AI, includ¬ 
ing versions of search and mathematical optimization, 
logic, methods based on probability and economics, and 
many others. The AI field is interdisciplinary, in which a 
number of sciences and professions converge, including 
computer science, mathematics, psychology, linguistics, 
philosophy and neuroscience, as well as other specialized 
fields such as artificial psychology. 

The field was founded on the claim that a central prop¬ 
erty of humans, intelligence—the sapience of Homo sapi¬ 
ens —"can be so precisely described that a machine can 
be made to simulate it.” 181 This raises philosophical is¬ 


sues about the nature of the mind and the ethics of cre¬ 
ating artificial beings endowed with human-like intelli¬ 
gence, issues which have been addressed by myth, fiction 
and philosophy since antiquity. 191 Artificial intelligence 
has been the subject of tremendous optimism 1101 but has 
also suffered stunning setbacks. 1111 Today it has become 
an essential part of the technology industry, providing the 
heavy lifting for many of the most challenging problems 
in computer science. 1121 


2.1 History 

Main articles: History of artificial intelligence and 
Timeline of artificial intelligence 

Thinking machines and artificial beings appear in Greek 
myths, such as Talos of Crete, the bronze robot of 
Hephaestus, and Pygmalion’s Galatea. 1131 Human like¬ 
nesses believed to have intelligence were built in ev¬ 
ery major civilization: animated cult images were wor¬ 
shiped in Egypt and Greece 1 141 and humanoid automatons 
were built by Yan Shi, Hero of Alexandria and Al- 
Jazari. 1 151 It was also widely believed that artificial be¬ 
ings had been created by Jabir ibn Hayyan, Judah Loew 
and Paracelsus. 116 ' By the 19th and 20th centuries, arti¬ 
ficial beings had become a common feature in fiction, as 
in Mary Shelley's Frankenstein or Karel Capek's R. U.R. 
(Rossum’s Universal Robots)) 11 ' 1 Pamela McCorduck ar¬ 
gues that all of these are some examples of an ancient 
urge, as she describes it, “to forge the gods”. 191 Stories of 
these creatures and their fates discuss many of the same 
hopes, fears and ethical concerns that are presented by 
artificial intelligence. 

Mechanical or “formal” reasoning has been developed 
by philosophers and mathematicians since antiquity. 
The study of logic led directly to the invention of the 
programmable digital electronic computer, based on the 
work of mathematician Alan Turing and others. Turing’s 
theory of computation suggested that a machine, by shuf¬ 
fling symbols as simple as “0” and “1”, could simulate 
any conceivable act of mathematical deduction. 1181119 ' 
This, along with concurrent discoveries in neurology. 
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information theory and cybernetics, inspired a small 
group of researchers to begin to seriously consider the 
possibility of building an electronic brain. 1201 

The field of AI research was founded at a conference 
on the campus of Dartmouth College in the summer 
of 1956. 1211 The attendees, including John McCarthy, 
Marvin Minsky, Allen Newell, Arthur Samuel, and 
Herbert Simon, became the leaders of AI research for 
many decades. 1221 They and their students wrote pro¬ 
grams that were, to most people, simply astonishing: 1 221 
computers were winning at checkers, solving word prob¬ 
lems in algebra, proving logical theorems and speak¬ 
ing English. 1241 By the middle of the 1960s, research in 
the U.S. was heavily funded by the Department of De¬ 
fense 1251 and laboratories had been established around the 
world. 1261 AI’s founders were profoundly optimistic about 
the future of the new field: Herbert Simon predicted that 
“machines will be capable, within twenty years, of doing 
any work a man can do” and Marvin Minsky agreed, writ¬ 
ing that “within a generation ... the problem of creating 
'artificial intelligence' will substantially be solved”. 1271 

They had failed to recognize the difficulty of some of the 
problems they faced. 1281 In 1974, in response to the criti¬ 
cism of Sir James Lighthill 1291 and ongoing pressure from 
the US Congress to fund more productive projects, both 
the U.S. and British governments cut off all undirected 
exploratory research in AI. The next few years would later 
be called an "AI winter", 1301 a period when funding for AI 
projects was hard to find. 

In the early 1980s, AI research was revived by the com¬ 
mercial success of expert systems, 1311 a form of AI pro¬ 
gram that simulated the knowledge and analytical skills 
of one or more human experts. By 1985 the market for 
AI had reached over a billion dollars. At the same time, 
Japan’s fifth generation computer project inspired the U.S 
and British governments to restore funding for academic 
research in the field. 1321 However, beginning with the col¬ 
lapse of the Lisp Machine market in 1987, AI once again 
fell into disrepute, and a second, longer lasting AI winter 
began. 1331 

In the 1990s and early 21st century, AI achieved its great¬ 
est successes, albeit somewhat behind the scenes. Artifi¬ 
cial intelligence is used for logistics, data mining, medical 
diagnosis and many other areas throughout the technol¬ 
ogy industry. 1121 The success was due to several factors: 
the increasing computational power of computers (see 
Moore’s law), a greater emphasis on solving specific sub¬ 
problems, the creation of new ties between AI and other 
fields working on similar problems, and a new commit¬ 
ment by researchers to solid mathematical methods and 
rigorous scientific standards. 1 ’ 41 

On 11 May 1997, Deep Blue became the first com¬ 
puter chess-playing system to beat a reigning world chess 
champion, Garry Kasparov. 1351 In February 2011, in a 
Jeopardy! quiz show exhibition match, IBM's question 
answering system, Watson, defeated the two greatest 


Jeopardy champions. Brad Rutter and Ken Jennings, by 
a significant margin. 1361 The Kinect, which provides a 
3D body-motion interface for the Xbox 360 and the 
Xbox One, uses algorithms that emerged from lengthy 
AI research 1371 as do intelligent personal assistants in 
smartphones. 1381 


2.2 Research 

2.2.1 Goals 

You awake one morning to find your brain 
has another lobe functioning. Invisible, this 
auxiliary lobe answers your questions with 
information beyond the realm of your own 
memory, suggests plausible courses of action, 
and asks questions that help bring out relevant 
facts. You quickly come to rely on the new 
lobe so much that you stop wondering how it 
works. You just use it. This is the dream of 
artificial intelligence. 

—BYTE, April 1985 [39] 


The general problem of simulating (or creating) intelli¬ 
gence has been broken down into a number of specific 
sub-problems. These consist of particular traits or capa¬ 
bilities that researchers would like an intelligent system 
to display. The traits described below have received the 
most attention. 161 

Deduction, reasoning, problem solving 

Early AI researchers developed algorithms that imitated 
the step-by-step reasoning that humans use when they 
solve puzzles or make logical deductions. 1401 By the late 
1980s and 1990s, AI research had also developed highly 
successful methods for dealing with uncertain or incom¬ 
plete information, employing concepts from probability 
and economics. 1411 

For difficult problems, most of these algorithms can re¬ 
quire enormous computational resources - most experi¬ 
ence a "combinatorial explosion": the amount of memory 
or computer time required becomes astronomical when 
the problem goes beyond a certain size. The search for 
more efficient problem-solving algorithms is a high pri¬ 
ority for AI research. 1421 

Human beings solve most of their problems using fast, 
intuitive judgements rather than the conscious, step- 
by-step deduction that early AI research was able to 
model. 1431 AI has made some progress at imitating this 
kind of “sub-symbolic” problem solving: embodied agent 
approaches emphasize the importance of sensorimotor 
skills to higher reasoning; neural net research attempts 
to simulate the structures inside the brain that give rise to 
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this skill; statistical approaches to AI mimic the proba¬ 
bilistic nature of the human ability to guess. 

Knowledge representation 



An ontology represents knowledge as a set of concepts within a 
domain and the relationships between those concepts. 

Main articles: Knowledge representation and 

Commonsense knowledge 

Knowledge representation 144 ' and knowledge engineer¬ 
ing 145 ' are central to AI research. Many of the prob¬ 
lems machines are expected to solve will require extensive 
knowledge about the world. Among the things that AI 
needs to represent are: objects, properties, categories and 
relations between objects;' 46 ' situations, events, states and 
time;' 47 ' causes and effects;' 48 ' knowledge about knowl¬ 
edge (what we know about what other people know);' 49 ' 
and many other, less well researched domains. A rep¬ 
resentation of “what exists” is an ontology: the set of 
objects, relations, concepts and so on that the machine 
knows about. The most general are called upper ontolo¬ 
gies, which attempt to provide a foundation for all other 
knowledge.' 50 ' 

Among the most difficult problems in knowledge repre¬ 
sentation are: 

Default reasoning and the qualification problem 

Many of the things people know take the form 
of “working assumptions.” For example, if a bird 
comes up in conversation, people typically picture 
an animal that is fist sized, sings, and flies. None 
of these things are true about all birds. John 
McCarthy identified this problem in 1969' 51 ' as the 
qualification problem: for any commonsense rule 
that AI researchers care to represent, there tend to 
be a huge number of exceptions. Almost nothing 
is simply true or false in the way that abstract logic 


requires. AI research has explored a number of 
solutions to this problem.' 52 ' 

The breadth of commonsense knowledge The num¬ 
ber of atomic facts that the average person knows 
is astronomical. Research projects that attempt to 
build a complete knowledge base of commonsense 
knowledge (e.g., Cyc) require enormous amounts 
of laborious ontological engineering —they must 
be built, by hand, one complicated concept at a 
time.' 53 ' A major goal is to have the computer 
understand enough concepts to be able to learn by 
reading from sources like the internet, and thus be 
able to add to its own ontology. 

The subsymbolic form of some commonsense knowledge 

Much of what people know is not represented as 
“facts” or “statements” that they could express 
verbally. For example, a chess master will avoid 
a particular chess position because it “feels too 
exposed”' 54 ' or an art critic can take one look at a 
statue and instantly realize that it is a fake.' 55 ' These 
are intuitions or tendencies that are represented in 
the brain non-consciously and sub-symbolically.' 56 ' 
Knowledge like this informs, supports and provides 
a context for symbolic, conscious knowledge. As 
with the related problem of sub-symbolic reason¬ 
ing, it is hoped that situated AI, computational 
intelligence, or statistical AI will provide ways to 
represent this kind of knowledge.' 56 ' 

Planning 


Hierarchical Control System 



A hierarchical control system is a form of control system in which 
a set of devices and governing software is arranged in a hierar¬ 
chy. 

Main article: Automated planning and scheduling 

Intelligent agents must be able to set goals and achieve 
them.' 57 ' They need a way to visualize the future (they 
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must have a representation of the state of the world and 
be able to make predictions about how their actions will 
change it) and be able to make choices that maximize the 
utility (or “value”) of the available choices. 1581 

In classical planning problems, the agent can assume that 
it is the only thing acting on the world and it can be certain 
what the consequences of its actions may be. 1591 However, 
if the agent is not the only actor, it must periodically as¬ 
certain whether the world matches its predictions and it 
must change its plan as this becomes necessary, requiring 
the agent to reason under uncertainty. 1601 

Multi-agent planning uses the cooperation and competi¬ 
tion of many agents to achieve a given goal. Emergent 
behavior such as this is used by evolutionary algorithms 
and swarm intelligence. 1611 

Learning 

Main article: Machine learning 

Machine learning is the study of computer algorithms that 
improve automatically through experience 1621165 and has 
been central to AI research since the field’s inception. 1641 

Unsupervised learning is the ability to find patterns in 
a stream of input. Supervised learning includes both 
classification and numerical regression. Classification is 
used to determine what category something belongs in, 
after seeing a number of examples of things from several 
categories. Regression is the attempt to produce a func¬ 
tion that describes the relationship between inputs and 
outputs and predicts how the outputs should change as the 
inputs change. In reinforcement learning 1651 the agent is 
rewarded for good responses and punished for bad ones. 
The agent uses this sequence of rewards and punishments 
to form a strategy for operating in its problem space. 
These three types of learning can be analyzed in terms of 
decision theory, using concepts like utility. The mathe¬ 
matical analysis of machine learning algorithms and their 
performance is a branch of theoretical computer science 
known as computational learning theory. 1661 

Within developmental robotics, developmental learning 
approaches were elaborated for lifelong cumulative ac¬ 
quisition of repertoires of novel skills by a robot, through 
autonomous self-exploration and social interaction with 
human teachers, and using guidance mechanisms such 
as active learning, maturation, motor synergies, and 
imitation. 1671168116911701 


Natural language processing (communication) 

Main article: Natural language processing 

Natural language processing 1711 gives machines the abil¬ 
ity to read and understand the languages that humans 
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A parse tree represents the syntactic structure of a sentence ac¬ 
cording to some formal grammar. 

speak. A sufficiently powerful natural language process¬ 
ing system would enable natural language user interfaces 
and the acquisition of knowledge directly from human- 
written sources, such as newswire texts. Some straightfor¬ 
ward applications of natural language processing include 
information retrieval (or text mining), question answer¬ 
ing 1721 and machine translation. 1731 

A common method of processing and extracting mean¬ 
ing from natural language is through semantic indexing. 
Increases in processing speeds and the drop in the cost 
of data storage makes indexing large volumes of abstrac¬ 
tions of the user’s input much more efficient. 

Perception 

Main articles: Machine perception. Computer vision and 
Speech recognition 

Machine perception 1741 is the ability to use input from 
sensors (such as cameras, microphones, tactile sensors, 
sonar and others more exotic) to deduce aspects of the 
world. Computer vision 1751 is the ability to analyze visual 
input. A few selected subproblems are speech recogni¬ 
tion, 1761 facial recognition and object recognition. 1771 

Motion and manipulation 

Main article: Robotics 

The field of robotics 1781 is closely related to AI. Intelli¬ 
gence is required for robots to be able to handle such 
tasks as object manipulation 1791 and navigation, with sub¬ 
problems of localization (knowing where you are, or 
finding out where other things are), mapping (learning 
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what is around you, building a map of the environment), 
and motion planning (figuring out how to get there) or 
path planning (going from one point in space to another 
point, which may involve compliant motion - where the 
robot moves while maintaining physical contact with an 
object). 1 8 °11 81 1 


Long-term goals 

Among the long-term goals in the research pertaining to 
artificial intelligence are: (1) Social intelligence, (2) Cre¬ 
ativity, and (3) General intelligence. 


Social intelligence Main article: Affective computing 
Affective computing is the study and development of 



Kismet, a robot with rudimentary social skills 


systems and devices that can recognize, interpret, pro¬ 
cess, and simulate human affects. 18211831 It is an interdis¬ 
ciplinary field spanning computer sciences, psychology, 
and cognitive science. 1841 While the origins of the field 
may be traced as far back as to early philosophical in¬ 
quiries into emotion, 1851 the more modern branch of com¬ 
puter science originated with Rosalind Picard's 1995 
paper 1861 on affective computing. 18711881 A motivation for 
the research is the ability to simulate empathy. The ma¬ 
chine should interpret the emotional state of humans and 
adapt its behaviour to them, giving an appropriate re¬ 
sponse for those emotions. 

Emotion and social skills 1891 play two roles for an intel¬ 
ligent agent. First, it must be able to predict the actions 
of others, by understanding their motives and emotional 
states. (This involves elements of game theory, decision 
theory, as well as the ability to model human emotions 
and the perceptual skills to detect emotions.) Also, in an 
effort to facilitate human-computer interaction, an intelli¬ 
gent machine might want to be able to display emotions— 
even if it does not actually experience them itself—in or¬ 
der to appear sensitive to the emotional dynamics of hu¬ 
man interaction. 


Creativity Main article: Computational creativity 

A sub-field of AI addresses creativity both theoretically 
(from a philosophical and psychological perspective) and 
practically (via specific implementations of systems that 
generate outputs that can be considered creative, or sys¬ 
tems that identify and assess creativity). Related areas of 
computational research are Artificial intuition and Artifi¬ 
cial thinking. 


General intelligence Main articles: Artificial general 
intelligence and Al-complete 

Many researchers think that their work will eventually 
be incorporated into a machine with general intelligence 
(known as strong AI), combining all the skills above and 
exceeding human abilities at most or all of them. 171 A few 
believe that anthropomorphic features like artificial con¬ 
sciousness or an artificial brain may be required for such 
a project. 19011911 

Many of the problems above may require general in¬ 
telligence to be considered solved. For example, even 
a straightforward, specific task like machine translation 
requires that the machine read and write in both lan¬ 
guages (NLP), follow the author’s argument (reason), 
know what is being talked about (knowledge), and faith¬ 
fully reproduce the author’s intention (social intelligence). 
A problem like machine translation is considered "AI- 
complete". In order to solve this particular problem, you 
must solve all the problems. 1921 


2.2.2 Approaches 

There is no established unifying theory or paradigm that 
guides AI research. Researchers disagree about many 
issues. 1931 A few of the most long standing questions 
that have remained unanswered are these: should artifi¬ 
cial intelligence simulate natural intelligence by studying 
psychology or neurology? Or is human biology as irrele¬ 
vant to AI research as bird biology is to aeronautical engi¬ 
neering? 1941 Can intelligent behavior be described using 
simple, elegant principles (such as logic or optimization)? 
Or does it necessarily require solving a large num¬ 
ber of completely unrelated problems? 1951 Can intelli¬ 
gence be reproduced using high-level symbols, similar 
to words and ideas? Or does it require “sub-symbolic” 
processing? 1961 John Haugeland, who coined the term 
GOFAI (Good Old-Fashioned Artificial Intelligence), 
also proposed that AI should more properly be referred to 
as synthetic intelligence, 1971 a term which has since been 
adopted by some non-GOFAI researchers. 19811991 





14 


CHAPTER 2. ARTIFICIAL INT ELL IGENCE 


Cybernetics and brain simulation 

Main articles: Cybernetics and Computational neuro¬ 
science 

In the 1940s and 1950s, a number of researchers explored 
the connection between neurology, information theory, 
and cybernetics. Some of them built machines that used 
electronic networks to exhibit rudimentary intelligence, 
such as W. Grey Walter's turtles and the Johns Hopkins 
Beast. Many of these researchers gathered for meetings 
of the Teleological Society at Princeton University and 
the Ratio Club in England. 1201 By 1960, this approach was 
largely abandoned, although elements of it would be re¬ 
vived in the 1980s. 

Symbolic 

Main article: Symbolic AI 

When access to digital computers became possible in the 
middle 1950s, AI research began to explore the possi¬ 
bility that human intelligence could be reduced to sym¬ 
bol manipulation. The research was centered in three 
institutions: Carnegie Mellon University, Stanford and 
MIT, and each one developed its own style of research. 
John Haugeland named these approaches to AI “good old 
fashioned AI” or "GOFAI". [1001 During the 1960s, sym¬ 
bolic approaches had achieved great success at simulat¬ 
ing high-level thinking in small demonstration programs. 
Approaches based on cybernetics or neural networks 
were abandoned or pushed into the background. 11011 Re¬ 
searchers in the 1960s and the 1970s were convinced that 
symbolic approaches would eventually succeed in creat¬ 
ing a machine with artificial general intelligence and con¬ 
sidered this the goal of their field. 

Cognitive simulation Economist Herbert Simon and 
Allen Newell studied human problem-solving skills 
and attempted to formalize them, and their work 
laid the foundations of the field of artificial intelli¬ 
gence, as well as cognitive science, operations re¬ 
search and management science. Their research 
team used the results of psychological experiments 
to develop programs that simulated the techniques 
that people used to solve problems. This tradition, 
centered at Carnegie Mellon University would even¬ 
tually culminate in the development of the Soar ar¬ 
chitecture in the middle 1980s. [102][103] 

Logic-based Unlike Newell and Simon, John McCarthy 
felt that machines did not need to simulate human 
thought, but should instead try to find the essence 
of abstract reasoning and problem solving, regard¬ 
less of whether people used the same algorithms. 1941 
His laboratory at Stanford (SAIL) focused on using 
formal logic to solve a wide variety of problems. 


including knowledge representation, planning and 
learning. 11041 Logic was also the focus of the work 
at the University of Edinburgh and elsewhere in Eu¬ 
rope which led to the development of the program¬ 
ming language Prolog and the science of logic pro¬ 
gramming. 11051 

“Anti-logic” or “scruffy” Researchers at MIT (such as 
Marvin Minsky and Seymour Papert) 11061 found that 
solving difficult problems in vision and natural lan¬ 
guage processing required ad-hoc solutions - they 
argued that there was no simple and general prin¬ 
ciple (like logic) that would capture all the as¬ 
pects of intelligent behavior. Roger Schank de¬ 
scribed their “anti-logic” approaches as "scruffy" 
(as opposed to the "neat" paradigms at CMU and 
Stanford). 1951 Commonsense knowledge bases (such 
as Doug Lenat's Cyc) are an example of “scruffy” 
Af, since they must be built by hand, one compli¬ 
cated concept at a time. 11071 

Knowledge-based When computers with large mem¬ 
ories became available around 1970, researchers 
from all three traditions began to build knowledge 
into AI applications. 11081 This “knowledge revo¬ 
lution” led to the development and deployment 
of expert systems (introduced by Edward Feigen- 
baum), the first truly successful form of AI 
software. 1311 The knowledge revolution was also 
driven by the realization that enormous amounts of 
knowledge would be required by many simple AI 
applications. 

Sub-symbolic 

By the 1980s progress in symbolic AI seemed to stall and 
many believed that symbolic systems would never be able 
to imitate all the processes of human cognition, especially 
perception, robotics, learning and pattern recognition. A 
number of researchers began to look into “sub-symbolic” 
approaches to specific AI problems. 1961 r 

Bottom-up, embodied, situated, behavior-based or 
nouvelle AI 

Researchers from the related field of robotics, such 
as Rodney Brooks, rejected symbolic AI and 
focused on the basic engineering problems that 
would allow robots to move and survive. 11091 Their 
work revived the non-symbolic viewpoint of the 
early cybernetics researchers of the 1950s and 
reintroduced the use of control theory in AI. This 
coincided with the development of the embodied 
mind thesis in the related field of cognitive science: 
the idea that aspects of the body (such as movement, 
perception and visualization) are required for higher 
intelligence. 

Computational intelligence and soft computing 

Interest in neural networks and "connectionism" 
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was revived by David Rumelhart and others in the 
middle 1980s. 11101 Neural networks are an example 
of soft computing — they are solutions to problems 
which cannot be solved with complete logical cer¬ 
tainty, and where an approximate solution is often 
enough. Other soft computing approaches to AI in¬ 
clude fuzzy systems, evolutionary computation and 
many statistical tools. The application of soft com¬ 
puting to AI is studied collectively by the emerging 
discipline of computational intelligence. 11111 

Statistical 

In the 1990s, AI researchers developed sophisticated 
mathematical tools to solve specific subproblems. These 
tools are truly scientific, in the sense that their results 
are both measurable and verifiable, and they have been 
responsible for many of AI’s recent successes. The 
shared mathematical language has also permitted a high 
level of collaboration with more established fields (like 
mathematics, economics or operations research). Stuart 
Russell and Peter Norvig describe this movement as 
nothing less than a “revolution” and “the victory of the 
neats. ” [341 Critics argue that these techniques (with few 
exceptions 1 1 l2| j are too focused on particular problems 
and have failed to address the long-term goal of general 
intelligence. 1 11,1 There is an ongoing debate about the rel¬ 
evance and validity of statistical approaches in AI, exem¬ 
plified in part by exchanges between Peter Norvig and 
Noam Chomsky. [114][115] 

Integrating the approaches 

Intelligent agent paradigm An intelligent agent is a 
system that perceives its environment and takes ac¬ 
tions which maximize its chances of success. The 
simplest intelligent agents are programs that solve 
specific problems. More complicated agents include 
human beings and organizations of human beings 
(such as firms). The paradigm gives researchers li¬ 
cense to study isolated problems and find solutions 
that are both verifiable and useful, without agree¬ 
ing on one single approach. An agent that solves a 
specific problem can use any approach that works - 
some agents are symbolic and logical, some are sub- 
symbolic neural networks and others may use new 
approaches. The paradigm also gives researchers 
a common language to communicate with other 
fields—such as decision theory and economics— 
that also use concepts of abstract agents. The intelli¬ 
gent agent paradigm became widely accepted during 
the 1990s. [2] 

Agent architectures and cognitive architectures 

Researchers have designed systems to build intel¬ 
ligent systems out of interacting intelligent agents 
in a multi-agent system. 11161 A system with both 


symbolic and sub-symbolic components is a hybrid 
intelligent system, and the study of such systems 
is artificial intelligence systems integration. A 
hierarchical control system provides a bridge be¬ 
tween sub-symbolic AI at its lowest, reactive levels 
and traditional symbolic AI at its highest levels, 
where relaxed time constraints permit planning and 
world modelling. 11171 Rodney Brooks' subsumption 
architecture was an early proposal for such a 
hierarchical system. 11181 

2.2.3 Tools 

In the course of 50 years of research, AI has developed a 
large number of tools to solve the most difficult problems 
in computer science. A few of the most general of these 
methods are discussed below. 


Search and optimization 

Main articles: Search algorithm. Mathematical opti¬ 
mization and Evolutionary computation 

Many problems in AI can be solved in theory by intel¬ 
ligently searching through many possible solutions: 11141 
Reasoning can be reduced to performing a search. For 
example, logical proof can be viewed as searching for a 
path that leads from premises to conclusions, where each 
step is the application of an inference rule. 11201 Planning 
algorithms search through trees of goals and subgoals, 
attempting to find a path to a target goal, a process 
called means-ends analysis. [1211 Robotics algorithms for 
moving limbs and grasping objects use local searches 
in configuration space. 1791 Many learning algorithms use 
search algorithms based on optimization. 

Simple exhaustive searches 14221 are rarely sufficient for 
most real world problems: the search space (the num¬ 
ber of places to search) quickly grows to astronomical 
numbers. The result is a search that is too slow or never 
completes. The solution, for many problems, is to use 
"heuristics" or “rules of thumb” that eliminate choices 
that are unlikely to lead to the goal (called "pruning the 
search tree"). Heuristics supply the program with a “best 
guess” for the path on which the solution lies. 11231 Heuris¬ 
tics limit the search for solutions into a smaller sample 
size. [80] 

A very different kind of search came to prominence in the 
1990s, based on the mathematical theory of optimization. 
For many problems, it is possible to begin the search with 
some form of a guess and then refine the guess incremen¬ 
tally until no more refinements can be made. These algo¬ 
rithms can be visualized as blind hill climbing: we begin 
the search at a random point on the landscape, and then, 
by jumps or steps, we keep moving our guess uphill, un¬ 
til we reach the top. Other optimization algorithms are 
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simulated annealing, beam search and random optimiza¬ 
tion.! 124 ! 

Evolutionary computation uses a form of optimization 
search. For example, they may begin with a population 
of organisms (the guesses) and then allow them to mutate 
and recombine, selecting only the fittest to survive each 
generation (refining the guesses). Forms of evolutionary 
computation include swarm intelligence algorithms (such 
as ant colony or particle swarm optimization) 11251 and 
evolutionary algorithms (such as genetic algorithms, gene 
expression programming, and genetic programming).! 1261 

Logic 

Main articles: Logic programming and Automated 
reasoning 

Logic! 1271 j s usec | f or knowledge representation and prob¬ 
lem solving, but it can be applied to other problems 
as well. For example, the satplan algorithm uses logic 
for planning! 1281 anc j inductive logic programming is a 
method for learning. 11291 

Several different forms of logic are used in AI research. 
Propositional or sentential logic 11301 is the logic of state¬ 
ments which can be true or false. First-order logic 113 ' 1 
also allows the use of quantifiers and predicates, and can 
express facts about objects, their properties, and their 
relations with each other. Fuzzy logic, 11321 is a version 
of first-order logic which allows the truth of a statement 
to be represented as a value between 0 and 1, rather 
than simply True (1) or False (0). Fuzzy systems can be 
used for uncertain reasoning and have been widely used 
in modern industrial and consumer product control sys¬ 
tems. Subjective logic 11331 models uncertainty in a differ¬ 
ent and more explicit manner than fuzzy-logic: a given 
binomial opinion satisfies belief + disbelief + uncertainty 
= 1 within a Beta distribution. By this method, ignorance 
can be distinguished from probabilistic statements that an 
agent makes with high confidence. 

Default logics, non-monotonic logics and 
circumscription 1521 are forms of logic designed to 
help with default reasoning and the qualification prob¬ 
lem. Several extensions of logic have been designed 
to handle specific domains of knowledge, such as: 
description logics; 1461 situation calculus, event calculus 
and fluent calculus (for representing events and time); 1471 
causal calculus; 1481 belief calculus; and modal logics. 1491 

Probabilistic methods for uncertain reasoning 

Main articles: Bayesian network. Hidden Markov model, 
Kalman filter. Decision theory and Utility theory 

Many problems in AI (in reasoning, planning, learn¬ 
ing, perception and robotics) require the agent to oper¬ 


ate with incomplete or uncertain information. AI re¬ 
searchers have devised a number of powerful tools to 
solve these problems using methods from probability the¬ 
ory and economics. 11341 

Bayesian networks 11351 are a very general tool that can 
be used for a large number of problems: reasoning (us¬ 
ing the Bayesian inference algorithm), 11361 learning (using 
the expectation-maximization algorithm), 11371 planning 
(using decision networks) 11381 and perception (using 
dynamic Bayesian networks). 11391 Probabilistic algo¬ 
rithms can also be used for filtering, prediction, smooth¬ 
ing and finding explanations for streams of data, helping 
perception systems to analyze processes that occur over 
time (e.g., hidden Markov models or Kalman filters). 11391 

A key concept from the science of economics is "utility": 
a measure of how valuable something is to an intelli¬ 
gent agent. Precise mathematical tools have been devel¬ 
oped that analyze how an agent can make choices and 
plan, using decision theory, decision analysis, 11401 and 
information value theory. 1581 These tools include models 
such as Markov decision processes, 11411 dynamic decision 
networks, 11391 game theory and mechanism design. 11421 

Classifiers and statistical learning methods 

Main articles: Classifier (mathematics). Statistical 
classification and Machine learning 

The simplest AI applications can be divided into two 
types: classifiers (“if shiny then diamond”) and con¬ 
trollers (“if shiny then pick up”). Controllers do, how¬ 
ever, also classify conditions before inferring actions, and 
therefore classification forms a central part of many AI 
systems. Classifiers are functions that use pattern match¬ 
ing to determine a closest match. They can be tuned ac¬ 
cording to examples, making them very attractive for use 
in AI. These examples are known as observations or pat¬ 
terns. In supervised learning, each pattern belongs to a 
certain predefined class. A class can be seen as a deci¬ 
sion that has to be made. All the observations combined 
with their class labels are known as a data set. When a 
new observation is received, that observation is classified 
based on previous experience. 11431 

A classifier can be trained in various ways; there 
are many statistical and machine learning approaches. 
The most widely used classifiers are the neural net¬ 
work, 11441 kernel methods such as the support vector 
machine, 11451 k-nearest neighbor algorithm, 11461 Gaussian 
mixture model, 11471 naive Bayes classifier, 11481 and 
decision tree. 11491 The performance of these classifiers 
have been compared over a wide range of tasks. Clas¬ 
sifier performance depends greatly on the characteristics 
of the data to be classified. There is no single classifier 
that works best on all given problems; this is also referred 
to as the "no free lunch" theorem. Determining a suit¬ 
able classifier for a given problem is still more an art than 
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science. [150] Control theory 

Main article: Intelligent control 

Neural networks Control theory, the grandchild of cybernetics, has many 

important applications, especially in robotics. 11571 

Main articles: Artificial neural network and 

Connectionism 

The study of artificial neural networks 11441 began in the Languages 


Hidden 



Main article: List of programming languages for artificial 
intelligence 


AI researchers have developed several specialized 
languages for AI research, including Lisp 1 1581 and 
Prolog. [1591 


2.2.4 Evaluating progress 

Main article: Progress in artificial intelligence 


A neural network is an interconnected group of nodes, akin to the 
vast network of neurons in the human brain. 

decade before the field of AI research was founded, in 
the work of Walter Pitts and Warren McCullough. Other 
important early researchers were Frank Rosenblatt, who 
invented the perceptron and Paul Werbos who developed 
the backpropagation algorithm. 1 1511 

The main categories of networks are acyclic or 
feedforward neural networks (where the signal passes in 
only one direction) and recurrent neural networks (which 
allow feedback). Among the most popular feedforward 
networks are perceptrons, multi-layer perceptrons and 
radial basis networks. 11521 Among recurrent networks, 
the most famous is the Hopfield net, a form of attractor 
network, which was first described by John Hopfield in 
1982. 11531 Neural networks can be applied to the problem 
of intelligent control (for robotics) or learning, using 
such techniques as Hebbian learning and competitive 
learning. 11541 

Hierarchical temporal memory is an approach that mod¬ 
els some of the structural and algorithmic properties of 
the neocortex. 11551 The term "deep learning" gained trac¬ 
tion in the mid-2000s after a publication by Geoffrey 
Hinton and Ruslan Salakhutdinov showed how a many¬ 
layered feedforward neural network could be effectively 
pre-trained one layer at a time, treating each layer in turn 
as an unsupervised restricted Boltzmann machine, then 
using supervised backpropagation for fine-tuning. 11561 


In 1950, Alan Turing proposed a general procedure to 
test the intelligence of an agent now known as the Turing 
test. This procedure allows almost all the major problems 
of artificial intelligence to be tested. However, it is a very 
difficult challenge and at present all agents fail. 11601 

Artificial intelligence can also be evaluated on specific 
problems such as small problems in chemistry, hand¬ 
writing recognition and game-playing. Such tests have 
been termed subject matter expert Turing tests. Smaller 
problems provide more achievable goals and there are an 
ever-increasing number of positive results. [1611 

One classification for outcomes of an AI test is: 1 1621 

1. Optimal: it is not possible to perform better. 

2. Strong super-human: performs better than all hu¬ 
mans. 

3. Super-human: performs better than most humans. 

4. Sub-human: performs worse than most humans. 

For example, performance at draughts (i.e. checkers) is 
optimal, 11631 performance at chess is super-human and 
nearing strong super-human (see computer chess: com¬ 
puters versus human) and performance at many everyday 
tasks (such as recognizing a face or crossing a room with¬ 
out bumping into something) is sub-human. 

A quite different approach measures machine intelli¬ 
gence through tests which are developed from mathe¬ 
matical definitions of intelligence. Examples of these 
kinds of tests start in the late nineties devising intelligence 
tests using notions from Kolmogorov complexity and data 
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compression. 11641 Two major advantages of mathemati¬ 
cal definitions are their applicability to nonhuman intel¬ 
ligences and their absence of a requirement for human 
testers. 

A derivative of the Turing test is the Completely Auto¬ 
mated Public Turing test to tell Computers and Humans 
Apart (CAPTCHA). as the name implies, this helps to 
determine that a user is an actual person and not a com¬ 
puter posing as a human. In contrast to the standard Tur¬ 
ing test, CAPTCHA administered by a machine and tar¬ 
geted to a human as opposed to being administered by 
a human and targeted to a machine. A computer asks a 
user to complete a simple test then generates a grade for 
that test. Computers are unable to solve the problem, so 
correct solutions are deemed to be the result of a person 
taking the test. A common type of CAPTCHA is the 
test that requires the typing of distorted letters, numbers 
or symbols that appear in an image undecipherable by a 
computer. 11631 

2.3 Applications 



An automated online assistant providing customer service on a 
web page - one of many very primitive applications of artificial 
intelligence. 

Main article: Applications of artificial intelligence 

Artificial intelligence techniques are pervasive and are too 
numerous to list. Frequently, when a technique reaches 
mainstream use, it is no longer considered artificial intelli¬ 
gence; this phenomenon is described as the AI effect. 1166 ' 
An area that artificial intelligence has contributed greatly 


to is intrusion detection. 11671 

2.3.1 Competitions and prizes 

Main article: Competitions and prizes in artificial 
intelligence 

There are a number of competitions and prizes to pro¬ 
mote research in artificial intelligence. The main areas 
promoted are: general machine intelligence, conversa¬ 
tional behavior, data-mining, robotic cars, robot soccer 
and games. 

2.3.2 Platforms 

A platform (or "computing platform") is defined as “some 
sort of hardware architecture or software framework (in¬ 
cluding application frameworks), that allows software to 
run.” As Rodney Brooks pointed out many years ago, 11681 
it is not just the artificial intelligence software that defines 
the AI features of the platform, but rather the actual plat¬ 
form itself that affects the AI that results, i.e., there needs 
to be work in AI problems on real-world platforms rather 
than in isolation. 

A wide variety of platforms has allowed different aspects 
of AI to develop, ranging from expert systems, albeit 
PC-based but still an entire real-world system, to vari¬ 
ous robot platforms such as the widely available Roomba 
with open interface. 11691 

2.3.3 Toys 

AIBO, the first robotic pet, grew out of Sony’s Computer 
Science Laboratory (CSL). Famed engineer Toshitada 
Doi is credited as AIBO’s original progenitor: in 1994 
he had started work on robots with artificial intelligence 
expert Masahiro Fujita, at CSL. Doi’s friend, the artist 
Hajime Sorayama, was enlisted to create the initial de¬ 
signs for the AIBO’s body. Those designs are now part of 
the permanent collections of Museum of Modern Art and 
the Smithsonian Institution, with later versions of AIBO 
being used in studies in Carnegie Mellon University. In 
2006, AIBO was added into Carnegie Mellon University’s 
“Robot Hall of Fame”. 


2.4 Philosophy and ethics 

Main articles: Philosophy of artificial intelligence and 
Ethics of artificial intelligence 

Alan Turing wrote in 1950 “I propose to consider the 
question 'can a machine think'?" 1 160 and began the dis¬ 
cussion that has become the philosophy of artificial intel- 
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ligence. Because “thinking” is difficult to define, there 
are two versions of the question that philosophers have 
addressed. First, can a machine be intelligent? I.e., can 
it solve all the problems the humans solve by using intel¬ 
ligence? And second, can a machine be built with a mind 
and the experience of subjective consciousness? 11701 

The existence of an artificial intelligence that rivals or 
exceeds human intelligence raises difficult ethical issues, 
both on behalf of humans and on behalf of any possi¬ 
ble sentient AI. The potential power of the technology 
inspires both hopes and fears for society. 

2.4.1 The possibility/impossibility of arti¬ 
ficial general intelligence 

Main articles: philosophy of AI, Turing test. Physical 
symbol systems hypothesis, Dreyfus’ critique of AI, The 
Emperor’s New Mind and AI effect 

Can a machine be intelligent? Can it “think"? 

Turing’s “polite convention” We need not decide if a 
machine can “think"; we need only decide if a ma¬ 
chine can act as intelligently as a human being. This 
approach to the philosophical problems associated 
with artificial intelligence forms the basis of the 
Turing test. 11601 

The Dartmouth proposal “Every aspect of learning or 
any other feature of intelligence can be so precisely 
described that a machine can be made to simulate 
it.” This conjecture was printed in the proposal for 
the Dartmouth Conference of 1956, and represents 
the position of most working AI researchers. 11711 

Newell and Simon’s physical symbol system hypothesis 
“A physical symbol system has the necessary and 
sufficient means of general intelligent action.” 
Newell and Simon argue that intelligence consists 
of formal operations on symbols. 11721 Hubert 
Dreyfus argued that, on the contrary, human ex¬ 
pertise depends on unconscious instinct rather than 
conscious symbol manipulation and on having a 
“feel” for the situation rather than explicit symbolic 
knowledge. (See Dreyfus’ critique of AI.) 1173111741 

Godelian arguments Godel himself, 11751 John Lucas (in 
1961) and Roger Penrose (in a more detailed argu¬ 
ment from 1989 onwards) argued that humans are 
not reducible to Turing machines. 11761 The detailed 
arguments are complex, but in essence they derive 
from Kurt Godel's 1931 proof in his first incom¬ 
pleteness theorem that it is always possible to create 
statements that a formal system could not prove. A 
human being, however, can (with some thought) see 
the truth of these “Godel statements”. Any Turing 


program designed to search for these statements can 
have its methods reduced to a formal system, and so 
will always have a “Godel statement” derivable from 
its program which it can never discover. However, if 
humans are indeed capable of understanding mathe¬ 
matical truth, it doesn't seem possible that we could 
be limited in the same way. This is quite a general 
result, if accepted, since it can be shown that hard¬ 
ware neural nets, and computers based on random 
processes (e.g. annealing approaches) and quantum 
computers based on entangled qubits (so long as they 
involve no new physics) can all be reduced to Tur¬ 
ing machines. All they do is reduce the complex¬ 
ity of the tasks, not permit new types of problems 
to be solved. Roger Penrose speculates that there 
may be new physics involved in our brain, perhaps 
at the intersection of gravity and quantum mechan¬ 
ics at the Planck scale. This argument, if accepted 
does not rule out the possibility of true artificial in¬ 
telligence, but means it has to be biological in basis 
or based on new physical principles. The argument 
has been followed up by many counter arguments, 
and then Roger Penrose has replied to those with 
counter counter examples, and it is now an intricate 
complex debate. 11771 For details see Philosophy of 
artificial intelligence: Lucas, Penrose and Godel 

The artificial brain argument The brain can be simu¬ 
lated by machines and because brains are intelli¬ 
gent, simulated brains must also be intelligent; thus 
machines can be intelligent. Hans Moravec, Ray 
Kurzweil and others have argued that it is techno¬ 
logically feasible to copy the brain directly into hard¬ 
ware and software, and that such a simulation will be 
essentially identical to the original. 1911 

The AI effect Machines are already intelligent, but ob¬ 
servers have failed to recognize it. When Deep Blue 
beat Gary Kasparov in chess, the machine was acting 
intelligently. However, onlookers commonly dis¬ 
count the behavior of an artificial intelligence pro¬ 
gram by arguing that it is not “real” intelligence af¬ 
ter all; thus “real” intelligence is whatever intelligent 
behavior people can do that machines still can not. 
This is known as the AI Effect: “AI is whatever 
hasn't been done yet.” 

2.4.2 Intelligent behaviour and machine 
ethics 

As a minimum, an AI system must be able to reproduce 
aspects of human intelligence. This raises the issue of 
how ethically the machine should behave towards both 
humans and other AI agents. This issue was addressed 
by Wendell Wallach in his book titled Moral Machines 
in which he introduced the concept of artificial moral 
agents (AMA). 11781 For Wallach, AMAs have become a 
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part of the research landscape of artificial intelligence 
as guided by its two central questions which he identi¬ 
fies as “Does Humanity Want Computers Making Moral 
Decisions’^ 1791 and “Can (Ro)bots Really Be Moral”. [1801 
For Wallach the question is not centered on the issue 
of whether machines can demonstrate the equivalent of 
moral behavior in contrast to the constraints which soci¬ 
ety may place on the development of AMAs. 11811 

Machine ethics 

Main article: Machine ethics 

The field of machine ethics is concerned with giving ma¬ 
chines ethical principles, or a procedure for discovering a 
way to resolve the ethical dilemmas they might encounter, 
enabling them to function in an ethically responsible man¬ 
ner through their own ethical decision making. 11821 The 
field was delineated in the AAAI Fall 2005 Symposium 
on Machine Ethics: “Past research concerning the rela¬ 
tionship between technology and ethics has largely fo¬ 
cused on responsible and irresponsible use of technol¬ 
ogy by human beings, with a few people being inter¬ 
ested in how human beings ought to treat machines. In 
all cases, only human beings have engaged in ethical rea¬ 
soning. The time has come for adding an ethical dimen¬ 
sion to at least some machines. Recognition of the ethi¬ 
cal ramifications of behavior involving machines, as well 
as recent and potential developments in machine auton¬ 
omy, necessitate this. In contrast to computer hacking, 
software property issues, privacy issues and other topics 
normally ascribed to computer ethics, machine ethics is 
concerned with the behavior of machines towards human 
users and other machines. Research in machine ethics 
is key to alleviating concerns with autonomous systems 
— it could be argued that the notion of autonomous ma¬ 
chines without such a dimension is at the root of all fear 
concerning machine intelligence. Further, investigation 
of machine ethics could enable the discovery of prob¬ 
lems with current ethical theories, advancing our think¬ 
ing about Ethics.” [183] Machine ethics is sometimes re¬ 
ferred to as machine morality, computational ethics or 
computational morality. A variety of perspectives of this 
nascent field can be found in the collected edition “Ma¬ 
chine Ethics” [182) that stems from the AAAI Fall 2005 
Symposium on Machine Ethics. [183] 

Malevolent and friendly AI 

Main article: Friendly AI 

Political scientist Charles T. Rubin believes that AI can be 
neither designed nor guaranteed to be benevolent. 11841 He 
argues that “any sufficiently advanced benevolence may 
be indistinguishable from malevolence.” Humans should 
not assume machines or robots would treat us favorably. 


because there is no a priori reason to believe that they 
would be sympathetic to our system of morality, which 
has evolved along with our particular biology (which AIs 
would not share). Hyper-intelligent software may not 
necessarily decide to support the continued existence of 
mankind, and would be extremely difficult to stop. This 
topic has also recently begun to be discussed in academic 
publications as a real source of risks to civilization, hu¬ 
mans, and planet Earth. 

Physicist Stephen Hawking, Microsoft founder Bill Gates 
and SpaceX founder Elon Musk have expressed concerns 
about the possibility that AI could evolve to the point that 
humans could not control it, with Hawking theorizing that 
this could "spell the end of the human race". 11851 

One proposal to deal with this is to ensure that the first 
generally intelligent AI is 'Friendly AI', and will then be 
able to control subsequently developed AIs. Some ques¬ 
tion whether this kind of check could really remain in 
place. 

Leading AI researcher Rodney Brooks writes, “I think it 
is a mistake to be worrying about us developing malevo¬ 
lent AI anytime in the next few hundred years. I think the 
worry stems from a fundamental error in not distinguish¬ 
ing the difference between the very real recent advances 
in a particular aspect of AI, and the enormity and com¬ 
plexity of building sentient volitional intelligence.” 11861 

Devaluation of humanity 

Main article: Computer Power and Human Reason 

Joseph Weizenbaum wrote that AI applications can not, 
by definition, successfully simulate genuine human em¬ 
pathy and that the use of AI technology in fields such as 
customer service or psychotherapy 11871 was deeply mis¬ 
guided. Weizenbaum was also bothered that AI re¬ 
searchers (and some philosophers) were willing to view 
the human mind as nothing more than a computer pro¬ 
gram (a position now known as computationalism). To 
Weizenbaum these points suggest that AI research deval¬ 
ues human life. [188] 

Decrease in demand for human labor 

Martin Ford, author of The Lights in the Tunnel: Automa¬ 
tion, Accelerating Technology and the Economy of the Fu- 
ture} m] and others argue that specialized artificial intel¬ 
ligence applications, robotics and other forms of automa¬ 
tion will ultimately result in significant unemployment 
as machines begin to match and exceed the capability 
of workers to perform most routine and repetitive jobs. 
Ford predicts that many knowledge-based occupations— 
and in particular entry level jobs—will be increasingly 
susceptible to automation via expert systems, machine 
learning 11901 and other Al-enhanced applications. AI- 
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based applications may also be used to amplify the ca¬ 
pabilities of low-wage offshore workers, making it more 
feasible to outsource knowledge work. 1191 ' 

2.4.3 Machine consciousness, sentience 
and mind 

Main article: Artificial consciousness 

If an AI system replicates all key aspects of human in¬ 
telligence, will that system also be sentient - will it have 
a mind which has conscious experiences? This question 
is closely related to the philosophical problem as to the 
nature of human consciousness, generally referred to as 
the hard problem of consciousness. 

Consciousness 

Main articles: Hard problem of consciousness and 
Theory of mind 

There are no objective criteria for knowing whether an 
intelligent agent is sentient - that it has conscious expe¬ 
riences. We assume that other people do because we do 
and they tell us that they do, but this is only a subjective 
determination. The lack of any hard criteria is known as 
the “hard problem” in the theory of consciousness. The 
problem applies not only to other people but to the higher 
animals and, by extension, to AI agents. 

Computationalism 

Main articles: Computationalism and Functionalism 
(philosophy of mind) 

Are human intelligence, consciousness and mind prod¬ 
ucts of information processing? Is the brain essentially a 
computer? 

Computationalism is the idea that “the human mind or 
the human brain (or both) is an information processing 
system and that thinking is a form of computing”. AI, 
or implementing machines with human intelligence was 
founded on the claim that “a central property of humans, 
intelligence can be so precisely described that a machine 
can be made to simulate it”. A program can then be de¬ 
rived from this human human computer and implemented 
into an artificial one to, create efficient artificial intel¬ 
ligence. This program would act upon a set of outputs 
that result from set inputs of the internal memory of the 
computer, that is, the machine can only act with what it 
has implemented in it to start with. A long term goal for 
AI researchers is to provide machines with a deep under¬ 
standing of the many abilities of a human being to repli¬ 
cate a general intelligence or STRONG AI, defined as a 


machine surpassing human abilities to perform the skills 
implanted in it, a scary thought to many, who fear los¬ 
ing control of such a powerful machine. Obstacles for 
researchers are mainly time contstraints. That is, AI sci¬ 
entists cannot establish much of a database for common- 
sense knowledge because it must be ontologically crafted 
into the machine which takes up a tremendous amount of 
time. To combat this, AI research looks to have the ma¬ 
chine able to understand enough concepts in order to add 
to its own ontology, but how can it do this when machine 
ethics is primarily concerned with behavior of machines 
towards humans or other machines, limiting the extent of 
developing AI. In order to function like a common human 
AI must also display, “the ability to solve subsymbolic 
commonsense knowledge tasks such as how artists can 
tell statues are fake or how chess masters don’t move cer¬ 
tain spots to avoid exposure,” but by developing machines 
who can do it all AI research is faced with the difficulty of 
potentially putting a lot of people out of work, while on 
the economy side of things businesses would boom from 
efficiency, thus forcing AI into a bottleneck trying to de¬ 
veloping self improving machines. 


Strong AI hypothesis 

Main article: Chinese room 

Searle’s strong AI hypothesis states that “The appropri¬ 
ately programmed computer with the right inputs and out¬ 
puts would thereby have a mind in exactly the same sense 
human beings have minds.” 1 1921 John Searle counters this 
assertion with his Chinese room argument, which asks 
us to look inside the computer and try to find where the 
“mind” might be.' 193 ' 


Robot rights 

Main article: Robot rights 

Mary Shelley's Frankenstein considers a key issue in the 
ethics of artificial intelligence: if a machine can be cre¬ 
ated that has intelligence, could it also feel ? If it can feel, 
does it have the same rights as a human? The idea also 
appears in modern science fiction, such as the film A. I.: 
Artificial Intelligence , in which humanoid machines have 
the ability to feel emotions. This issue, now known as 
"robot rights", is currently being considered by, for exam¬ 
ple, California’s Institute for the Future, although many 
critics believe that the discussion is premature. 11941 The 
subject is profoundly discussed in the 2010 documentary 
film Plug & Pray.' 195 ' 
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2.4.4 Superintelligence 

Main article: Superintelligence 

Are there limits to how intelligent machines - or human- 
machine hybrids - can be? A superintelligence, hyper¬ 
intelligence, or superhuman intelligence is a hypothetical 
agent that would possess intelligence far surpassing that 
of the brightest and most gifted human mind. ‘’Superin¬ 
telligence” may also refer to the form or degree of intel¬ 
ligence possessed by such an agent. 

Technological singularity 

Main articles: Technological singularity and Moore’s law 

If research into Strong AI produced sufficiently intelli¬ 
gent software, it might be able to reprogram and im¬ 
prove itself. The improved software would be even 
better at improving itself, leading to recursive self- 
improvement. 11961 The new intelligence could thus in¬ 
crease exponentially and dramatically surpass humans. 
Science fiction writer Vernor Vinge named this scenario 
"singularity". 11971 Technological singularity is when ac¬ 
celerating progress in technologies will cause a runaway 
effect wherein artificial intelligence will exceed human 
intellectual capacity and control, thus radically changing 
or even ending civilization. Because the capabilities of 
such an intelligence may be impossible to comprehend, 
the technological singularity is an occurrence beyond 
which events are unpredictable or even unfathomable. 1 1971 

Ray Kurzweil has used Moore’s law (which describes the 
relentless exponential improvement in digital technology) 
to calculate that desktop computers will have the same 
processing power as human brains by the year 2029, and 
predicts that the singularity will occur in 2045. 11971 

Transhumanism 

Main article: Transhumanism 

Robot designer Hans Moravec, cyberneticist Kevin War¬ 
wick and inventor Ray Kurzweil have predicted that hu¬ 
mans and machines will merge in the future into cyborgs 
that are more capable and powerful than either. 11981 This 
idea, called transhumanism, which has roots in Aldous 
Huxley and Robert Ettinger, has been illustrated in fic¬ 
tion as well, for example in the manga Ghost in the Shell 
and the science-fiction series Dune. 

In the 1980s artist Hajime Sorayama's Sexy Robots series 
were painted and published in Japan depicting the actual 
organic human form with lifelike muscular metallic skins 
and later “the Gynoids” book followed that was used by 
or influenced movie makers including George Lucas and 
other creatives. Sorayama never considered these organic 


robots to be real part of nature but always unnatural prod¬ 
uct of the human mind, a fantasy existing in the mind even 
when realized in actual form. 

Edward Fredkin argues that “artificial intelligence is the 
next stage in evolution”, an idea first proposed by Samuel 
Butler's "Darwin among the Machines" (1863), and ex¬ 
panded upon by George Dyson in his book of the same 
name in 1998. 11991 


2.5 In fiction 

Main article: Artificial intelligence in fiction 

The implications of artificial intelligence have been a per¬ 
sistent theme in science fiction. Early stories typically re¬ 
volved around intelligent robots. The word “robot” itself 
was coined by Karel Capek in his 1921 play R. U.R.. the 
title standing for "Rossum’s Universal Robots". Later, 
the SF writer Isaac Asimov developed the three laws of 
robotics which he subsequently explored in a long series 
of robot stories. These laws have since gained some trac¬ 
tion in genuine AI research. 

Other influential fictional intelligences include HAL, the 
computer in charge of the spaceship in 2001: A Space 
Odyssey , released as both a film and a book in 1968 and 
written by Arthur C. Clarke. 

Since then, AI has become firmly rooted in popular cul¬ 
ture. 


2.6 See also 

Main article: Outline of artificial intelligence 


• AI takeover 

• Artificial Intelligence (journal) 

• Artificial intelligence (video games) 

• Artificial stupidity 

• Nick Bostrom 

• Computer Go 

• Effective altruism 

• Existential risk 

• Existential risk of artificial general intelligence 

• Future of Humanity Institute 

• Human Cognome Project 

• List of artificial intelligence projects 
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• List of artificial intelligence researchers 

• List of emerging technologies 

• List of important artificial intelligence publications 

• List of machine learning algorithms 

• List of scientific journals 

• Machine ethics 

• Machine learning 

• Never-Ending Language Learning 

• Our Final Invention 

• Outline of artificial intelligence 

• Outline of human intelligence 

• Philosophy of mind 

• Simulated reality 

• Superintelligence 

2.7 Notes 

[ 1 ] Definition of AI as the study of intelligent agents: 

• Poole, Mackworth & Goebel 1998, p. 1, which pro¬ 
vides the version that is used in this article. Note 
that they use the term “computational intelligence” 
as a synonym for artificial intelligence. 

• Russell & Norvig (2003) (who prefer the term “ra¬ 
tional agent”) and write “The whole-agent view is 
now widely accepted in the field” (Russell & Norvig 
2003, p. 55). 

• Nilsson 1998 

• Legg & Hutter 2007. 

[2J The intelligent agent paradigm: 

• Russell & Norvig 2003, pp. 27, 32-58, 968-972 

• Poole, Mackworth & Goebel 1998, pp. 7-21 

• Luger & Stubblefield 2004. pp. 235-240 

• Hutter 2005, pp. 125-126 

The definition used in this article, in terms of goals, ac¬ 
tions, perception and environment, is due to Russell & 
Norvig (2003). Other definitions also include knowledge 
and learning as additional criteria. 

[3] Although there is some controversy on this point (see 
Crevier (1993, p. 50)), McCarthy states unequivocally “I 
came up with the term” in a clnet interview. (Skillings 
2006) McCarthy first used the term in the proposal 
for the Dartmouth conference, which appeared in 1955. 
(McCarthy et al. 1955) 

[4] McCarthy's definition of AI: 


• McCarthy 2007 

[5] Pamela McCorduck (2004, pp. 424) writes of “the rough 
shattering of AI in subfields—vision, natural language, de¬ 
cision theory, genetic algorithms, robotics ... and these 
with own sub-subfield—that would hardly have anything 
to say to each other.” 

[6] This list of intelligent traits is based on the topics covered 
by the major AI textbooks, including: 

• Russell & Norvig 2003 

• Luger & Stubblefield 2004 

• Poole, Mackworth & Goebel 1998 

• Nilsson 1998 

[7] General intelligence (strong AI) is discussed in popular 
introductions to AI: 

• Kurzweil 1999 and Kurzweil 2005 

[8] See the Dartmouth proposal, under Philosophy, below. 

[9] This is a central idea of Pamela McCorduck's Machines 
Who Think. She writes: “I like to think of artificial intel¬ 
ligence as the scientific apotheosis of a venerable cultural 
tradition.” (McCorduck 2004. p. 34) “Artificial intelli¬ 
gence in one form or another is an idea that has pervaded 
Western intellectual history, a dream in urgent need of 
being realized.” (McCorduck 2004, p. xviii) “Our his¬ 
tory is full of attempts—nutty, eerie, comical, earnest, 
legendary and real—to make artificial intelligences, to re¬ 
produce what is the essential us—bypassing the ordinary 
means. Back and forth between myth and reality, our 
imaginations supplying what our workshops couldn't, we 
have engaged for a long time in this odd form of self¬ 
reproduction.” (McCorduck 2004, p. 3) She traces the 
desire back to its Hellenistic roots and calls it the urge to 
“forge the Gods.” (McCorduck 2004, pp. 340-400) 

[10] The optimism referred to includes the predictions of early 
AI researchers (see optimism in the history of AI) as 
well as the ideas of modern transhumanists such as Ray 
Kurzweil. 

[11] The “setbacks” referred to include the ALPAC report 
of 1966, the abandonment of perceptrons in 1970, the 
Lighthill Report of 1973 and the collapse of the Lisp ma¬ 
chine market in 1987. 

[12] AI applications widely used behind the scenes: 

• Russell & Norvig 2003, p. 28 

• Kurzweil 2005, p. 265 

• NRC 1999, pp. 216-222 

[13] Alin myth: 

• McCorduck 2004, pp. 4-5 

• Russell & Norvig 2003, p. 939 

[ 14] Cult images as artificial intelligence: 

• Crevier (1993, p. 1) (statue of Amun) 

• McCorduck (2004, pp. 6-9) 
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These were the first machines to be believed to have true 
intelligence and consciousness. Hermes Trismegistus ex¬ 
pressed the common belief that with these statues, crafts¬ 
man had reproduced “the true nature of the gods”, their 
sensus and spiritus. McCorduck makes the connection 
between sacred automatons and Mosaic law (developed 
around the same time), which expressly forbids the wor¬ 
ship of robots (McCorduck 2004, pp. 6-9) 

[15] Humanoid automata: 

Yan Shi: 

• Needham 1986, p. 53 
Hero of Alexandria: 

• McCorduck 2004, p. 6 
Al-Jazari: 

• “A Thirteenth Century Programmable Robot”. 
Shef.ac.uk. Retrieved 25 April 2009. 

Wolfgang von Kempelen: 

• McCorduck 2004, p. 17 

[16] Artificial beings: 

Jabir ibn Hayyan's Takwin: 

• O'Connor 1994 
Judah Loew's Golem: 

• McCorduck 2004. pp. 15-16 

• Buchanan 2005, p. 50 

Paracelsus’ Homunculus: 

• McCorduck 2004. pp. 13-14 

[17] AI in early science fiction. 

• McCorduck 2004. pp. 17-25 

[18] This insight, that digital computers can simulate any pro¬ 
cess of formal reasoning, is known as the Church-Turing 
thesis. 

[ 19] Formal reasoning: 

• Berlinski, David (2000). The Advent of the Al¬ 
gorithm. Harcourt Books. ISBN 0-15-601391-6. 
OCLC 46890682. 

[20] ATs immediate precursors: 

• McCorduck 2004. pp. 51-107 

• Crevier 1993, pp. 27-32 

• Russell & Norvig 2003, pp. 15, 940 

• Moravec 1988, p. 3 

See also Cybernetics and early neural networks (in History 
of artificial intelligence). Among the researchers who laid 
the foundations of AI were Alan Turing, John von Neu¬ 
mann, Norbert Wiener, Claude Shannon, Warren McCul¬ 
lough, Walter Pitts and Donald Hebb. 


• McCorduck 2004, pp. 111-136 

• Crevier 1993, pp. 47-49, who writes “the confer¬ 
ence is generally recognized as the official birthdate 
of the new science.” 

• Russell & Norvig 2003, p. 17, who call the confer¬ 
ence “the birth of artificial intelligence.” 

• NRC 1999, pp. 200-201 

[22] Hegemony of the Dartmouth conference attendees: 

• Russell & Norvig 2003, p. 17, who write “for the 
next 20 years the field would be dominated by these 
people and their students.” 

• McCorduck 2004. pp. 129-130 

[23] Russell and Norvig write “it was astonishing whenever 
a computer did anything kind of smartish.” Russell & 
Norvig 2003, p. 18 

[24] "Golden years" of AI (successful symbolic reasoning pro¬ 
grams 1956-1973): 

• McCorduck 2004, pp. 243-252 

• Crevier 1993, pp. 52-107 

• Moravec 1988, p. 9 

• Russell & Norvig 2003, pp. 18-21 

The programs described are Arthur Samuel’s checkers 
program for the IBM 701, Daniel Bobrow’s STUDENT, 
Newell and Simon's Logic Theorist and Terry Winograd's 
SHRDLU. 

[25] DARPA pours money into undirected pure research into 
AI during the 1960s: 

• McCorduck 2004, pp. 131 

• Crevier 1993, pp. 51, 64-65 

• NRC 1999, pp. 204-205 

[26] AI in England: 

• Howe 1994 

[27] Optimism of early AI: 

• Herbert Simon quote: Simon 1965, p. 96 quoted in 
Crevier 1993, p. 109. 

• Marvin Minsky quote: Minsky 1967, p. 2 quoted 
in Crevier 1993, p. 109. 

[28] See The problems (in History of artificial intelligence) 

[29] Lighthill 1973. 

[30] First AI Winter, Mansfield Amendment, Lighthill report 

• Crevier 1993, pp. 115-117 

• Russell & Norvig 2003, p. 22 

• NRC 1999, pp. 212-213 

• Howe 1994 

[31] Expert systems: 


[21] Dartmouth conference : 


• ACM 1998. 1.2.1 
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• Russell & Norvig 2003, pp. 22-24 

• Luger & Stubblefield 2004. pp. 227-331 

• Nilsson 1998, chpt. 17.4 

• McCorduck 2004, pp. 327-335, 434-435 

• Crevier 1993, pp. 145-62, 197-203 

[32] Boom of the 1980s: rise of expert systems, Fifth Genera¬ 
tion Project, Alvey, MCC, SCI: 

• McCorduck 2004. pp. 426-441 

• Crevier 1993, pp. 161-162,197-203,211,240 

• Russell & Norvig 2003, p. 24 

• NRC 1999, pp. 210-211 

[33] Second AI winter: 

• McCorduck 2004. pp. 430-435 

• Crevier 1993, pp. 209-210 

• NRC 1999, pp. 214-216 

[34] Formal methods are now preferred (“Victory of the 
neats"): 

• Russell & Norvig 2003, pp. 25-26 

• McCorduck 2004. pp. 486-487 

[35] McCorduck 2004, pp. 480-483 

[36] Markoff 2011. 

[37] Administrator. “Kinect’s AI breakthrough explained”, i- 
programmer. info. 

[38] http ://readwrite .com/2013/01/15/ 

virtual- personal-assistants-the-future-of-your- smartphone- 

[39] Lemmons, Phil (April 1985). “Artificial Intelligence”. 
BYTE. p. 125. Retrieved 14 February 2015. 

[40] Problem solving, puzzle solving, game playing and deduc¬ 
tion: 

• Russell & Norvig 2003, chpt. 3-9, 

• Poole, Mackworth & Goebel 1998, chpt. 2,3,7,9, 

• Luger & Stubblefield 2004, chpt. 3,4,6.8, 

• Nilsson 1998, chpt. 7-12 

[41] Uncertain reasoning: 

• Russell & Norvig 2003, pp. 452-644, 

• Poole, Mackworth & Goebel 1998. pp. 345-395, 

• Luger & Stubblefield 2004, pp. 333-381, 

• Nilsson 1998, chpt. 19 

[42] Intractability and efficiency and the combinatorial explo¬ 
sion: 

• Russell & Norvig 2003, pp. 9, 21-22 

[43] Psychological evidence of sub-symbolic reasoning: 


• Wason & Shapiro (1966) showed that people do 
poorly on completely abstract problems, but if the 
problem is restated to allow the use of intuitive 
social intelligence, performance dramatically im¬ 
proves. (See Wason selection task) 

• Kahneman, Slovic & Tversky (1982) have shown 
that people are terrible at elementary problems that 
involve uncertain reasoning. (See fist of cognitive 
biases for several examples). 

• Lakoff & Nunez (2000) have controversially ar¬ 
gued that even our skills at mathematics depend on 
knowledge and skills that come from “the body”, 
i.e. sensorimotor and perceptual skills. (See Where 
Mathematics Comes From) 

[44] Knowledge representation: 

• ACM 1998, 1.2.4, 

• Russell & Norvig 2003, pp. 320-363, 

• Poole, Mackworth & Goebel 1998, pp. 23-46, 69- 
81, 169-196, 235-277, 281-298, 319-345, 

• Luger & Stubblefield 2004, pp. 227-243, 

• Nilsson 1998, chpt. 18 

[45] Knowledge engineering: 

• Russell & Norvig 2003, pp. 260-266, 

• Poole, Mackworth & Goebel 1998, pp. 199-233, 

• Nilsson 1998, chpt. -17.1-17.4 

[46] Representing categories and relations: Semantic net¬ 
works, description logics, inheritance (including frames 
and scripts): 

• Russell & Norvig 2003, pp. 349-354, 
infographic poole , Mackworth & Goebel i 99 g, pp 174-177, 

• Luger & Stubblefield 2004, pp. 248-258, 

• Nilsson 1998, chpt. 18.3 

[47] Representing events and time Situation calculus, event 
calculus, fluent calculus (including solving the frame prob¬ 
lem): 

• Russell & Norvig 2003, pp. 328-341, 

• Poole, Mackworth & Goebel 1998, pp. 281-298, 

• Nilsson 1998, chpt. 18.2 

[48] Causal calculus: 

• Poole, Mackworth & Goebel 1998, pp. 335-337 

[49] Representing knowledge about knowledge: Belief calcu¬ 
lus, modal logics: 

• Russell & Norvig 2003, pp. 341-344, 

• Poole, Mackworth & Goebel 1998, pp. 275-277 

[50] Ontology: 

• Russell & Norvig 2003, pp. 320-328 

[51] Qualification problem: 
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• McCarthy & Hayes 1969 

• Russell & Norvig 2003 

While McCarthy was primarily concerned with issues in 
the logical representation of actions, Russell & Norvig 
2003 apply the term to the more general issue of default 
reasoning in the vast network of assumptions underlying 
all our commonsense knowledge. 

[52] Default reasoning and default logic, non-monotonic log¬ 
ics, circumscription, closed world assumption, abduction 
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Chapter 3 

Information theory 


Not to be confused with information science. 

Information theory is a branch of applied mathematics, 
electrical engineering, and computer science involving 
the quantification of information. Information theory was 
developed by Claude E. Shannon to find fundamental lim¬ 
its on signal processing operations such as compressing 
data and on reliably storing and communicating data. 
Since its inception it has broadened to find applications in 
many other areas, including statistical inference, natural 
language processing, cryptography, neurobiology, 11 the 
evolution 121 and function 131 of molecular codes, model se¬ 
lection in ecology, 141 thermal physics, 15 ' quantum com¬ 
puting, linguistics, plagiarism detection, 161 pattern recog¬ 
nition, anomaly detection and other forms of data analy¬ 
sis. 171 

A key measure of information is entropy, which is usually 
expressed by the average number of bits needed to store 
or communicate one symbol in a message. Entropy quan¬ 
tifies the uncertainty involved in predicting the value of a 
random variable. For example, specifying the outcome of 
a fair coin flip (two equally likely outcomes) provides less 
information (lower entropy) than specifying the outcome 
from a roll of a die (six equally likely outcomes). 

Applications of fundamental topics of information the¬ 
ory include lossless data compression (e.g. ZIP files), 
lossy data compression (e.g. MP3s and JPEGs), and 
channel coding (e.g. for Digital Subscriber Line (DSL)). 
The field is at the intersection of mathematics, statistics, 
computer science, physics, neurobiology, and electrical 
engineering. Its impact has been crucial to the success 
of the Voyager missions to deep space, the invention 
of the compact disc, the feasibility of mobile phones, 
the development of the Internet, the study of linguistics 
and of human perception, the understanding of black 
holes, and numerous other fields. Important sub-fields 
of information theory are source coding, channel coding, 
algorithmic complexity theory, algorithmic information 
theory, information-theoretic security, and measures of 
information. 


3.1 Overview 


Information theory studies the transmission, processing, 
utilization, and extraction of information. Abstractly, in¬ 
formation can be thought of as the resolution of uncer¬ 
tainty. In the case of communication of information over 
a noisy channel, this abstract concept was made concrete 
in 1948 by Claude Shannon in A Mathematical Theory 
of Communication, in which “information” is thought of 
as a set of possible messages, where the goal is to send 
these messages over a noisy channel, and then to have the 
receiver reconstruct the message with low probability of 
error, in spite of the channel noise. Shannon’s main re¬ 
sult, the Noisy-channel coding theorem showed that, in 
the limit of many channel uses, the rate of information 
that is asymptotically achievable equal to the Channel ca¬ 
pacity, a quantity dependent merely on the statistics of 
the channel over which the messages are sent. 

Information theory is closely associated with a collection 
of pure and applied disciplines that have been investi¬ 
gated and reduced to engineering practice under a va¬ 
riety of rubrics throughout the world over the past half 
century or more: adaptive systems, anticipatory systems, 
artificial intelligence, complex systems, complexity sci¬ 
ence, cybernetics, informatics, machine learning, along 
with systems sciences of many descriptions. Informa¬ 
tion theory is a broad and deep mathematical theory, with 
equally broad and deep applications, amongst which is the 
vital field of coding theory. 

Coding theory is concerned with finding explicit methods, 
called codes , for increasing the efficiency and reducing 
the error rate of data communication over noisy chan¬ 
nels to near the Channel capacity. These codes can be 
roughly subdivided into data compression (source coding) 
and error-correction (channel coding) techniques. In the 
latter case, it took many years to find the methods Shan¬ 
non’s work proved were possible. A third class of infor¬ 
mation theory codes are cryptographic algorithms (both 
codes and ciphers). Concepts, methods and results from 
coding theory and information theory are widely used in 
cryptography and cryptanalysis. See the article ban (unit) 
for a historical application. 

Information theory is also used in information retrieval. 
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intelligence gathering, gambling, statistics, and even in 
musical composition. 


3.2 Historical background 

Main article: History of information theory 

The landmark event that established the discipline of in¬ 
formation theory, and brought it to immediate worldwide 
attention, was the publication of Claude E. Shannon's 
classic paper "A Mathematical Theory of Communica¬ 
tion" in the Bell System Technical Journal in July and Oc¬ 
tober 1948. 

Prior to this paper, limited information-theoretic ideas 
had been developed at Bell Labs, all implicitly assuming 
events of equal probability. Harry Nyquist's 1924 paper. 
Certain Factors Affecting Telegraph Speed, contains a the¬ 
oretical section quantifying “intelligence” and the “line 
speed” at which it can be transmitted by a communica¬ 
tion system, giving the relation W = K log m (recalling 
Boltzmann’s constant), where W is the speed of transmis¬ 
sion of intelligence, m is the number of different voltage 
levels to choose from at each time step, and K is a con¬ 
stant. Ralph Hartley's 1928 paper. Transmission of Infor¬ 
mation, uses the word information as a measurable quan¬ 
tity, reflecting the receiver’s ability to distinguish one se¬ 
quence of symbols from any other, thus quantifying infor¬ 
mation as H = log S n = n log S , where S was the num¬ 
ber of possible symbols, and n the number of symbols in 
a transmission. The unit of information was therefore the 
decimal digit, much later renamed the hartley in his hon¬ 
our as a unit or scale or measure of information. Alan 
Turing in 1940 used similar ideas as part of the statistical 
analysis of the breaking of the German second world war 
Enigma ciphers. 

Much of the mathematics behind information theory 
with events of different probabilities were developed for 
the field of thermodynamics by Ludwig Boltzmann and 
J. Willard Gibbs. Connections between information- 
theoretic entropy and thermodynamic entropy, includ¬ 
ing the important contributions by Rolf Landauer in the 
1960s, are explored in Entropy in thermodynamics and 
information theory. 

In Shannon’s revolutionary and groundbreaking paper, 
the work for which had been substantially completed at 
Bell Labs by the end of 1944, Shannon for the first time 
introduced the qualitative and quantitative model of com¬ 
munication as a statistical process underlying information 
theory, opening with the assertion that 

“The fundamental problem of communication 
is that of reproducing at one point, either ex¬ 
actly or approximately, a message selected at 
another point.” 


With it came the ideas of 

• the information entropy and redundancy of a source, 
and its relevance through the source coding theorem; 

• the mutual information, and the channel capacity of 
a noisy channel, including the promise of perfect 
loss-free communication given by the noisy-channel 
coding theorem; 

• the practical result of the Shannon-Hartley law for 
the channel capacity of a Gaussian channel; as well 
as 

• the bit— a new way of seeing the most fundamental 
unit of information. 


3.3 Quantities of information 

Main article: Quantities of information 

Information theory is based on probability theory and 
statistics. Information theory often concerns itself with 
measures of information of the distributions associated 
with random variables. Important quantities of informa¬ 
tion are entropy, a measure of information in a single 
random variable, and mutual information, a measure of 
information in common between two random variables. 
The former quantity is a property of the probability dis¬ 
tribution of a random variable and gives a limit on the rate 
at which data generated by independent samples with the 
given distribution can be reliably compressed. The latter 
is a property of the joint distribution of two random vari¬ 
able, and is the maximum rate of reliable communication 
across a noisy channel in the limit of long block lengths, 
when the channel statistics are determined by the joint 
distribution. 

The choice of logarithmic base in the following formulae 
determines the unit of information entropy that is used. 
A common unit of information is the bit, based on the 
binary logarithm. Other units include the nat, which is 
based on the natural logarithm, and the hartley, which is 
based on the common logarithm. 

In what follows, an expression of the form plogp is 
considered by convention to be equal to zero whenever 
p = 0. This is justified because lim^o-i- p log p = 0 for 
any logarithmic base. 

3.3.1 Entropy 

The entropy, H , of a discrete random variable X in¬ 
tuitively is a measure of the amount of uncertainty asso¬ 
ciated with the value of X when only its distribution is 
known. So, for example, if the distribution associated 
with a random variable was a constant distribution, (i.e. 
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0 0.5 1 

Pr(X = 1 ) 


3.3.2 Joint entropy 

The joint entropy of two discrete random variables X 
and Y is merely the entropy of their pairing: ( X , Y) . 
This implies that if X and Y are independent, then their 
joint entropy is the sum of their individual entropies. 

For example, if ( X , Y) represents the position of a chess 
piece — X the row and Y the column, then the joint 
entropy of the row of the piece and the column of the 
piece will be the entropy of the position of the piece. 


H{X,Y) = Ex,y[-logjp(a;,y)] = -^2 p{x,y) log p(x,y) 

x,y 

Despite similar notation, joint entropy should not be con¬ 
fused with cross entropy. 


Entropy of a Bernoulli trial as a function of success probability, 
often called the binary entropy function, H^{p) . The entropy 
is maximized at 1 bit per trial when the two possible outcomes are 
equally probable, as in an unbiased coin toss. 

equal to some known value with probability 1), then en¬ 
tropy is minimal, and equal to 0 . Furthermore, in the 
case of a distribution restricted to take on a finite number 
of values, entropy is maximized with a uniform distribu¬ 
tion over the values that the distribution takes on. 

Suppose one transmits 1000 bits (Os and Is). If the value 
of each these bits is known to the receiver (has a specific 
value with certainty) ahead of transmission, it is clear that 
no information is transmitted. If, however, each bit is 
independently equally likely to be 0 or 1, 1000 shannons 
of information (also often called bits, in the information 
theoretic sense) have been transmitted. Between these 
two extremes, information can be quantified as follows. 
If X is the set of all messages {x t ,..., x n } that X could 
be, and p(x) is the probability of some x e X , then the 
entropy, H , of X is defined: 181 

H{X) =E A -[/(a:)] = -^2p(x)logp(x). 

xGX 

(Here, I{x) is the self-information, which is the entropy 
contribution of an individual message, and E A is the 
expected value.) A property of entropy is that it is max¬ 
imized when all the messages in the message space are 
equiprobable p{x) = 1 jn ,—i.e., most unpredictable— 
in which case H(X) = log n . 

The special case of information entropy for a random 
variable with two outcomes is the binary entropy func¬ 
tion, usually taken to the logarithmic base 2, thus having 
the shannon (Sh) as unit: 


H h {p ) = -plog 2 p- (1 -p)log 2 (l -p). 


3.3.3 Conditional entropy (equivocation) 

The conditional entropy or conditional uncertainty of 

X given random variable Y (also called the equivocation 

of X about Y ) is the average conditional entropy over Y 
;[9] 


Because entropy can be conditioned on a random vari¬ 
able or on that random variable being a certain value, care 
should be taken not to confuse these two definitions of 
conditional entropy, the former of which is in more com¬ 
mon use. A basic property of this form of conditional 
entropy is that: 


3.3.4 Mutual information (transinforma¬ 
tion) 

Mutual information measures the amount of informa¬ 
tion that can be obtained about one random variable by 
observing another. It is important in communication 
where it can be used to maximize the amount of infor¬ 
mation shared between sent and received signals. The 
mutual information of X relative to Y is given by: 

I(X- Y) = E x ,y[SI(x, y)] = ^ p(x , y) log 

where SI (Specific mutual information) is the pointwise 
mutual information. 

A basic property of the mutual information is that 


H(X\Y)=E Y [H(X\y)} = -J2 p(y) £ p(x\y)logp(x\y) 

y£Y x&X 


H(X\Y) = H(X, Y) — H(Y). 


£* 

x,y 
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I{X-Y) = H{X)-H{X\Y). 

That is, knowing Y, we can save an average of I(X-Y) 
bits in encoding X compared to not knowing Y. 

Mutual information is symmetric: 

I{X■ Y) = I(Y ; X) = H(X) + H(Y) - H(X , Y). 

Mutual information can be expressed as the average 
Kullback-Leibler divergence (information gain) between 
the posterior probability distribution of X given the value 
of Y and the prior distribution on X: 


I(X;Y) = E^DMXIY = y)\\p(X))}. 

In other words, this is a measure of how much, on the av¬ 
erage, the probability distribution on X will change if we 
are given the value of Y. This is often recalculated as the 
divergence from the product of the marginal distributions 
to the actual joint distribution: 


I(X;Y) = D^{p{X,Y)\\p{X)p(Y)). 

Mutual information is closely related to the log-likelihood 
ratio test in the context of contingency tables and the 
multinomial distribution and to Pearson’s x 2 test: mutual 
information can be considered a statistic for assessing in¬ 
dependence between a pair of variables, and has a well- 
specified asymptotic distribution. 

3.3.5 Kullback-Leibler divergence (infor¬ 
mation gain) 

The Kullback-Leibler divergence (or information di¬ 
vergence, information gain, or relative entropy) is a 

way of comparing two distributions: a “true” probability 
distribution p(X), and an arbitrary probability distribution 
q(X). If we compress data in a manner that assumes q(X) 
is the distribution underlying some data, when, in real¬ 
ity, p(X) is the correct distribution, the Kullback-Leibler 
divergence is the number of average additional bits per 
datum necessary for compression. It is thus defined 


3.3.6 Kullback-Leibler divergence of a 
prior from the truth 

Another interpretation of KL divergence is this: suppose 
a number X is about to be drawn randomly from a discrete 
set with probability distribution p(x). If Alice knows the 
true distribution p(x), while Bob believes (has a prior) that 
the distribution is q(x), then Bob will be more surprised 
than Alice, on average, upon seeing the value of X. The 
KL divergence is the (objective) expected value of Bob’s 
(subjective) surprisal minus Alice’s surprisal, measured in 
bits if the log is in base 2. In this way, the extent to which 
Bob’s prior is “wrong” can be quantified in terms of how 
“unnecessarily surprised” it’s expected to make him. 

3.3.7 Other quantities 

Other important information theoretic quantities include 
Renyi entropy (a generalization of entropy), differential 
entropy (a generalization of quantities of information to 
continuous distributions), and the conditional mutual in¬ 
formation. 


3.4 Coding theory 


Main article: Coding theory 

Coding theory is one of the most important and direct 



A picture showing scratches on the readable surface of a CD-R. 
Music and data CDs are coded using error correcting codes and 
thus can still be read even if they have minor scratches using error 
detection and correction. 

applications of information theory. It can be subdivided 
into source coding theory and channel coding theory. Us¬ 
ing a statistical description for data, information theory 
quantifies the number of bits needed to describe the data, 
which is the information entropy of the source. 


• Data compression (source coding): There are two 

Dkl(p(X)MX)) = Y, - P (x)logq(x)-Y -p(*)logp(4 ) ™ig i ^f%^| rnpression P roblem: 

1. lossless data compression: the data must be recon- 
Although it is sometimes used as a 'distance metric', KL structed exactly 

divergence is not a true metric since it is not symmetric 
and does not satisfy the triangle inequality (making it a 
semi-quasimetric). 


2. lossy data compression: allocates bits needed to re¬ 
construct the data, within a specified fidelity level 
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measured by a distortion function. This subset of 
Information theory is called rate-distortion theory. 

• Error-correcting codes (channel coding): While 
data compression removes as much redundancy as 
possible, an error correcting code adds just the right 
kind of redundancy (i.e., error correction) needed 
to transmit the data efficiently and faithfully across 
a noisy channel. 

This division of coding theory into compression and 
transmission is justified by the information transmission 
theorems, or source-channel separation theorems that 
justify the use of bits as the universal currency for infor¬ 
mation in many contexts. However, these theorems only 
hold in the situation where one transmitting user wishes 
to communicate to one receiving user. In scenarios with 
more than one transmitter (the multiple-access channel), 
more than one receiver (the broadcast channel) or inter¬ 
mediary “helpers” (the relay channel), or more general 
networks, compression followed by transmission may no 
longer be optimal. Network information theory refers to 
these multi-agent communication models. 

3.4.1 Source theory 

Any process that generates successive messages can be 
considered a source of information. A memoryless 
source is one in which each message is an independent 
identically distributed random variable, whereas the 
properties of ergodicity and stationarity impose less re¬ 
strictive constraints. All such sources are stochastic. 
These terms are well studied in their own right outside 
information theory. 

Rate 

Information rate is the average entropy per symbol. For 
memoryless sources, this is merely the entropy of each 
symbol, while, in the case of a stationary stochastic pro¬ 
cess, it is 


t — lrm H(X n | JC n —i , A n _ 2 , W n _ 3 ,...), 

n—> oo 

that is, the conditional entropy of a symbol given all the 
previous symbols generated. For the more general case 
of a process that is not necessarily stationary, the average 
rate is 

r= lim -H(X 1 ,X 2 ,...X n ); 

n —yog Tl 

that is, the limit of the joint entropy per symbol. For 
stationary sources, these two expressions give the same 
result. 1101 


It is common in information theory to speak of the “rate” 
or “entropy” of a language. This is appropriate, for exam¬ 
ple, when the source of information is English prose. The 
rate of a source of information is related to its redundancy 
and how well it can be compressed, the subject of source 
coding. 

3.4.2 Channel capacity 

Main article: Channel capacity 

Communications over a channel—such as an ethernet 
cable— is the primary motivation of information theory. 
As anyone who’s ever used a telephone (mobile or land¬ 
line) knows, however, such channels often fail to produce 
exact reconstruction of a signal; noise, periods of silence, 
and other forms of signal corruption often degrade qual¬ 
ity. How much information can one hope to communicate 
over a noisy (or otherwise imperfect) channel? 

Consider the communications process over a discrete 
channel. A simple model of the process is shown below: 



Here X represents the space of messages transmitted, 
and Y the space of messages received during a unit time 
over our channel. Fet p(y\x) be the conditional probabil¬ 
ity distribution function of Y given X. We will consider 
p(y\x) to be an inherent fixed property of our communi¬ 
cations channel (representing the nature of the noise of 
our channel). Then the joint distribution of X and Y is 
completely determined by our channel and by our choice 
of f{x ), the marginal distribution of messages we choose 
to send over the channel. Under these constraints, we 
would like to maximize the rate of information, or the 
signal, we can communicate over the channel. The ap¬ 
propriate measure for this is the mutual information, and 
this maximum mutual information is called the channel 
capacity and is given by: 

C — max I(X; Y). 

This capacity has the following property related to com¬ 
municating at information rate R (where R is usually bits 
per symbol). For any information rate R < C and cod¬ 
ing error e > 0, for large enough N, there exists a code of 
length N and rate > R and a decoding algorithm, such that 
the maximal probability of block error is < e; that is, it 
is always possible to transmit with arbitrarily small block 
error. In addition, for any rate R > C, it is impossible to 
transmit with arbitrarily small block error. 

Channel coding is concerned with finding such nearly 
optimal codes that can be used to transmit data over a 
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noisy channel with a small coding error at a rate near the 
channel capacity. 

Capacity of particular channel models 

• A continuous-time analog communications channel 
subject to Gaussian noise — see Shannon-Hartley 
theorem. 

• A binary symmetric channel (BSC) with crossover 
probability p is a binary input, binary output channel 
that flips the input bit with probability p. The BSC 
has a capacity of 1 — H^(p) bits per channel use, 
where II^ is the binary entropy function to the base 
2 logarithm: 



• A binary erasure channel (BEC) with erasure prob¬ 
ability p is a binary input, ternary output channel. 
The possible channel outputs are 0 , 7, and a third 
symbol 'e' called an erasure. The erasure represents 
complete loss of information about an input bit. The 
capacity of the BEC is 1 - p bits per channel use. 




3.5 Applications to other fields 

3.5.1 Intelligence uses and secrecy applica¬ 
tions 


Europe. Shannon himself defined an important concept 
now called the unicity distance. Based on the redundancy 
of the plaintext, it attempts to give a minimum amount of 
ciphertext necessary to ensure unique decipherability. 

Information theory leads us to believe it is much more 
difficult to keep secrets than it might first appear. A 
brute force attack can break systems based on asymmetric 
key algorithms or on most commonly used methods of 
symmetric key algorithms (sometimes called secret key 
algorithms), such as block ciphers. The security of all 
such methods currently comes from the assumption that 
no known attack can break them in a practical amount of 
time. 

Information theoretic security refers to methods such as 
the one-time pad that are not vulnerable to such brute 
force attacks. In such cases, the positive conditional 
mutual information between the plaintext and ciphertext 
(conditioned on the key) can ensure proper transmis¬ 
sion, while the unconditional mutual information between 
the plaintext and ciphertext remains zero, resulting in 
absolutely secure communications. In other words, an 
eavesdropper would not be able to improve his or her 
guess of the plaintext by gaining knowledge of the ci¬ 
phertext but not of the key. However, as in any other 
cryptographic system, care must be used to correctly ap¬ 
ply even information-theoretically secure methods; the 
Venona project was able to crack the one-time pads of 
the Soviet Union due to their improper reuse of key ma¬ 
terial. 

3.5.2 Pseudorandom number generation 

Pseudorandom number generators are widely available 
in computer language libraries and application pro¬ 
grams. They are, almost universally, unsuited to cryp¬ 
tographic use as they do not evade the deterministic na¬ 
ture of modern computer equipment and software. A 
class of improved random number generators is termed 
cryptographically secure pseudorandom number gener¬ 
ators, but even they require random seeds external to 
the software to work as intended. These can be ob¬ 
tained via extractors, if done carefully. The measure 
of sufficient randomness in extractors is min-entropy, a 
value related to Shannon entropy through Renyi entropy; 
Renyi entropy is also used in evaluating randomness in 
cryptographic systems. Although related, the distinctions 
among these measures mean that a random variable with 
high Shannon entropy is not necessarily satisfactory for 
use in an extractor and so for cryptography uses. 

3.5.3 Seismic exploration 


Information theoretic concepts apply to cryptography and 
cryptanalysis. Turing's information unit, the ban, was 
used in the Ultra project, breaking the German Enigma 
machine code and hastening the end of World War II in 


One early commercial application of information theory 
was in the field of seismic oil exploration. Work in this 
field made it possible to strip off and separate the un¬ 
wanted noise from the desired seismic signal. Informa- 
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tion theory and digital signal processing offer a major im¬ 
provement of resolution and image clarity over previous 
analog methods. 1111 

3.5.4 Semiotics 

Concepts from information theory such as redundancy 
and code control have been used by semioticians such as 
Umberto Eco and Rossi-Landi to explain ideology as a 
form of message transmission whereby a dominant social 
class emits its message by using signs that exhibit a high 
degree of redundancy such that only one message is de¬ 
coded among a selection of competing ones. 1 121 


3.5.5 Miscellaneous applications 

Information theory also has applications in gambling and 
investing, black holes, bioinformatics, and music. 


3.6 See also 

• Algorithmic probability 

• Algorithmic information theory 

• Bayesian inference 

• Communication theory 

• Constructor theory - a generalization of information 
theory that includes quantum information 

• Inductive probability 

• Minimum message length 

• Minimum description length 

• List of important publications 

• Philosophy of information 


3.6.2 History 

• Hartley, R.V.L. 

• History of information theory 

• Shannon, C.E. 

• Timeline of information theory 

• Yockey, H.P. 

3.6.3 Theory 

• Coding theory 

• Detection theory 

• Estimation theory 

• Fisher information 

• Information algebra 

• Information asymmetry 

• Information field theory 

• Information geometry 

• Information theory and measure theory 

• Kolmogorov complexity 

• Logic of information 

• Network coding 

• Philosophy of Information 

• Quantum information science 

• Semiotic information theory 

• Source coding 

• Unsolved Problems 


3.6.1 Applications 

• Active networking 

• Cryptanalysis 

• Cryptography 

• Cybernetics 

• Entropy in thermodynamics and information theory 

• Gambling 

• Intelligence (information gathering) 

• Seismic exploration 


3.6.4 Concepts 

• Ban (unit) 

• Channel capacity 

• Channel (communications) 

• Communication source 

• Conditional entropy 

• Covert channel 

• Decoder 

• Differential entropy 

• Encoder 
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• Information entropy 

• Joint entropy 

• Kullback-Leibler divergence 

• Mutual information 

• Pointwise mutual information (PMI) 

• Receiver (information theory) 

• Redundancy 

• Renyi entropy 

• Self-information 

• Unicity distance 

• Variety 
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Chapter 4 

Computational science 


Not to be confused with computer science. 

Computational science (also scientific computing or 
scientific computation) is concerned with construct¬ 
ing mathematical models and quantitative analysis tech¬ 
niques and using computers to analyze and solve scientific 
problems. 1 ' 1 In practical use, it is typically the application 
of computer simulation and other forms of computation 
from numerical analysis and theoretical computer science 
to problems in various scientific disciplines. 

The field is different from theory and laboratory exper¬ 
iment which are the traditional forms of science and 
engineering. The scientific computing approach is to gain 
understanding, mainly through the analysis of mathemat¬ 
ical models implemented on computers. 

Scientists and engineers develop computer programs, 
application software, that model systems being studied 
and run these programs with various sets of input pa¬ 
rameters. In some cases, these models require massive 
amounts of calculations (usually floating-point) and are 
often executed on supercomputers or distributed comput¬ 
ing platforms. 

Numerical analysis is an important underpinning for tech¬ 
niques used in computational science. 

4.1 Applications of computational 
science 

Problem domains for computational science/scientific 
computing include: 

4.1.1 Numerical simulations 

Numerical simulations have different objectives depend¬ 
ing on the nature of the task being simulated: 

• Reconstruct and understand known events (e.g., 
earthquake, tsunamis and other natural disasters). 

• Predict future or unobserved situations (e.g., 
weather, sub-atomic particle behaviour, and 


primordial explosions). 

4.1.2 Model fitting and data analysis 

• Appropriately tune models or solve equations to re¬ 
flect observations, subject to model constraints (e.g. 
oil exploration geophysics, computational linguis¬ 
tics). 

• Use graph theory to model networks, such as those 
connecting individuals, organizations, websites, and 
biological systems. 

4.1.3 Computational optimization 

Main article: Mathematical optimization 


• Optimize known scenarios (e.g., technical and man¬ 
ufacturing processes, front-end engineering). 

• Machine learning 

4.2 Methods and algorithms 

Algorithms and mathematical methods used in compu¬ 
tational science are varied. Commonly applied methods 
include: 

• Numerical analysis 

• Application of Taylor series as convergent and 
asymptotic series 

• Computing derivatives by Automatic differentiation 
(AD) 

• Computing derivatives by finite differences 

• Graph theoretic suites 

• High order difference approximations via Taylor se¬ 
ries and Richardson extrapolation 
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• Methods of integration on a uniform mesh: 
rectangle rule (also called midpoint rule), trapezoid 
rule, Simpson’s rule 

• Runge Kutta method for solving ordinary differen¬ 
tial equations 

• Monte Carlo methods 

• Molecular dynamics 

• Linear programming 

• Branch and cut 

• Branch and Bound 

• Numerical linear algebra 

• Computing the LU factors by Gaussian elimination 

• Cholesky factorizations 

• Discrete Fourier transform and applications. 

• Newton’s method 

• Time stepping methods for dynamical systems 

Programming languages and computer algebra systems 
commonly used for the more mathematical aspects of 
scientific computing applications include R (program¬ 
ming language), TK Solver, MATLAB, Mathematical 21 
SciLab, GNU Octave, Python (programming language) 
with SciPy, and PDL. The more computationally inten¬ 
sive aspects of scientific computing will often use some 
variation of C or Fortran and optimized algebra libraries 
such as BLAS or LAPACK. 

Computational science application programs often model 
real-world changing conditions, such as weather, air flow 
around a plane, automobile body distortions in a crash, 
the motion of stars in a galaxy, an explosive device, etc. 
Such programs might create a 'logical mesh' in computer 
memory where each item corresponds to an area in space 
and contains information about that space relevant to the 
model. For example in weather models, each item might 
be a square kilometer; with land elevation, current wind 
direction, humidity, temperature, pressure, etc. The pro¬ 
gram would calculate the likely next state based on the 
current state, in simulated time steps, solving equations 
that describe how the system operates; and then repeat 
the process to calculate the next state. 

The term computational scientist is used to describe 
someone skilled in scientific computing. This person is 
usually a scientist, an engineer or an applied mathemati¬ 
cian who applies high-performance computing in differ¬ 
ent ways to advance the state-of-the-art in their respective 
applied disciplines in physics, chemistry or engineering. 
Scientific computing has increasingly also impacted on 
other areas including economics, biology and medicine. 


Computational science is now commonly considered a 
third mode of science, complementing and adding to 
experimentation/observation and theory. 13 * The essence 
of computational science is numerical algorithm 141 and/or 
computational mathematics. In fact, substantial effort in 
computational sciences has been devoted to the develop¬ 
ment of algorithms, the efficient implementation in pro¬ 
gramming languages, and validation of computational re¬ 
sults. A collection of problems and solutions in compu¬ 
tational science can be found in Steeb, Hardy, Hardy and 
Stoop, 2004. * 5 * 

4.3 Reproducibility and open re¬ 
search computing 

The complexity of computational methods is a threat to 
the reproducibility of research. Jon Claerbout has be¬ 
come prominent for pointing out that reproducible re¬ 
search requires archiving and documenting all raw data 
and all code used to obtain a result. 16 * 17 ** 8 * Nick Barnes, in 
the Science Code Manifesto, proposed five principles that 
should be followed when software is used in open science 
publication.* 9 ’ Tomi Kauppinen et al. established and de¬ 
fined Linked Open Science, an approach to interconnect 
scientific assets to enable transparent, reproducible and 
transdisciplinary research.* 10 * 


4.4 Journals 

Most scientific journals do not accept software papers be¬ 
cause a description of a reasonably mature software usu¬ 
ally does not meet the criterion of novelty. Outside com¬ 
puter science itself, there are only few journals dedicated 
to scientific software. Established journals like Elsevier's 
Computer Physics Communications publish papers that 
are not open-access (though the described software usu¬ 
ally is). To fill this gap, a new journal entitled Open re¬ 
search computation was announced in 2010;* 11 * it closed 
in 2012 without having published a single paper, for a 
lack of submissions probably due to excessive quality 
requirements.* 12 * A new initiative was launched in 2012, 
the Journal of Open Research Software J 13 * 


4.5 Education 

Scientific computation is most often studied through an 
applied mathematics or computer science program, or 
within a standard mathematics, sciences, or engineering 
program. At some institutions a specialization in scien¬ 
tific computation can be earned as a “minor” within an¬ 
other program (which may be at varying levels). How¬ 
ever, there are increasingly many bachelor’s and master’s 
programs in computational science. Some schools also 
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offer the Ph.D. in computational science, computational 
engineering, computational science and engineering, or 
scientific computation. 

There are also programs in areas such as computational 
physics, computational chemistry, etc. 

4.6 Related fields 

• Bioinformatics 

• Cheminformatics 

• Chemometrics 

• Computational archaeology 

• Computational biology 

• Computational chemistry 

• Computational economics 

• Computational electromagnetics 

• Computational engineering 

• Computational finance 

• Computational fluid dynamics 

• Computational forensics 

• Computational geophysics 

• Computational informatics 

• Computational intelligence 

• Computational law 

• Computational linguistics 

• Computational mathematics 

• Computational mechanics 

• Computational neuroscience 

• Computational particle physics 

• Computational physics 

• Computational sociology 

• Computational statistics 

• Computer algebra 

• Environmental simulation 

• Financial modeling 

• Geographic information system (GIS) 

• High performance computing 

• Machine learning 


• Network analysis 

• Neuroinformatics 

• Numerical linear algebra 

• Numerical weather prediction 

• Pattern recognition 

• Scientific visualization 


4.7 See also 

• Computational science and engineering 

• Comparison of computer algebra systems 

• List of molecular modeling software 

• List of numerical analysis software 

• List of statistical packages 

• Timeline of scientific computing 

• Simulated reality 
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Chapter 5 

Exploratory data analysis 


In statistics, exploratory data analysis (EDA) is an ap¬ 
proach to analyzing data sets to summarize their main 
characteristics, often with visual methods. A statistical 
model can be used or not, but primarily EDA is for see¬ 
ing what the data can tell us beyond the formal model¬ 
ing or hypothesis testing task. Exploratory data analysis 
was promoted by John Tukey to encourage statisticians to 
explore the data, and possibly formulate hypotheses that 
could lead to new data collection and experiments. EDA 
is different from initial data analysis (IDA), 1 11 which fo¬ 
cuses more narrowly on checking assumptions required 
for model fitting and hypothesis testing, and handling 
missing values and making transformations of variables 
as needed. EDA encompasses IDA. 


tailed distributions than traditional summaries (the mean 
and standard deviation). The packages .S', 5-PLUS, and 
R included routines using resampling statistics, such as 
Quenouille and Tukey’s jackknife and Efron ’s bootstrap, 
which are nonparametric and robust (for many problems). 

Exploratory data analysis, robust statistics, nonparamet¬ 
ric statistics, and the development of statistical program¬ 
ming languages facilitated statisticians’ work on scien¬ 
tific and engineering problems. Such problems included 
the fabrication of semiconductors and the understand¬ 
ing of communications networks, which concerned Bell 
Labs. These statistical developments, all championed 
by Tukey, were designed to complement the analytic 
theory of testing statistical hypotheses, particularly the 
Laplacian tradition’s emphasis on exponential families. 


5.1 Overview 


5.2 EDA development 


Tukey defined data analysis in 1961 as: " [Procedures for 
analyzing data, techniques for interpreting the results of 
such procedures, ways of planning the gathering of data 
to make its analysis easier, more precise or more accu¬ 
rate, and all the machinery and results of (mathematical) 
statistics which apply to analyzing data.” 121 

Tukey’s championing of EDA encouraged the devel¬ 
opment of statistical computing packages, especially S 
at Bell Labs. The S programming language inspired 
the systems 'S'-PLUS and R. This family of statistical- 
computing environments featured vastly improved dy¬ 
namic visualization capabilities, which allowed statisti¬ 
cians to identify outliers, trends and patterns in data that 
merited further study. 

Tukey’s EDA was related to two other developments in 
statistical theory: Robust statistics and nonparametric 
statistics, both of which tried to reduce the sensitivity 
of statistical inferences to errors in formulating statistical 
models. Tukey promoted the use of five number sum¬ 
mary of numerical data—the two extremes (maximum 
and minimum), the median, and the quartiles —because 
these median and quartiles, being functions of the 
empirical distribution are defined for all distributions, un¬ 
like the mean and standard deviation; moreover, the quar¬ 
tiles and median are more robust to skewed or heavy- 
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John W. Tukey wrote the book “Exploratory Data Anal¬ 
ysis” in 1977. 141 Tukey held that too much emphasis in 
statistics was placed on statistical hypothesis testing (con¬ 
firmatory data analysis); more emphasis needed to be 
placed on using data to suggest hypotheses to test. In par¬ 
ticular, he held that confusing the two types of analyses 
and employing them on the same set of data can lead to 
systematic bias owing to the issues inherent in testing hy¬ 
potheses suggested by the data. 
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The objectives of EDA are to: 

• Suggest hypotheses about the causes of observed 
phenomena 

• Assess assumptions on which statistical inference 
will be based 

• Support the selection of appropriate statistical tools 
and techniques 

• Provide a basis for further data collection through 
surveys or experiments 151 

Many EDA techniques have been adopted into data min¬ 
ing and are being taught to young students as a way to 
introduce them to statistical thinking. 161 

5.3 Techniques 

There are a number of tools that are useful for EDA, but 
EDA is characterized more by the attitude taken than by 
particular techniques. 171 

Typical graphical techniques used in EDA are: 

• Box plot 

• Histogram 

• Multi-vari chart 

• Run chart 

• Pareto chart 

• Scatter plot 

• Stem-and-leaf plot 

• Parallel coordinates 

• Odds ratio 

• Multidimensional scaling 

• Targeted projection pursuit 

• Principal component analysis 

• Multilinear PCA 

• Projection methods such as grand tour, guided tour 
and manual tour 

• Interactive versions of these plots 
Typical quantitative techniques are: 

• Median polish 

• Trimean 


5.4 History 

Many EDA ideas can be traced back to earlier authors, 
for example: 

• Francis Galton emphasized order statistics and 
quantiles. 

• Arthur Lyon Bowley used precursors of the stemplot 
and five-number summary (Bowley actually used 
a "seven-figure summary", including the extremes, 
deciles and quartiles, along with the median - see his 
Elementary Manual of Statistics (3rd edn., 1920), p. 
62 - he defines “the maximum and minimum, me¬ 
dian, quartiles and two deciles” as the “seven posi¬ 
tions”). 

• Andrew Ehrenberg articulated a philosophy of data 
reduction (see his book of the same name). 

The Open University course Statistics in Society (MDST 
242), took the above ideas and merged them with 
Gottfried Noether's work, which introduced statistical in¬ 
ference via coin-tossing and the median test. 

5.5 Example 

Findings from EDA are often orthogonal to the primary 
analysis task. This is an example, described in more detail 
in. 181 The analysis task is to find the variables which best 
predict the tip that a dining party will give to the waiter. 
The variables available are tip, total bill, gender, smoking 
status, time of day, day of the week and size of the party. 
The analysis task requires that a regression model be fit 
with either tip or tip rate as the response variable. The 
fitted model is 

tip rate = 0.18 - O.Olxsize 

which says that as the size of the dining party increase 
by one person tip will decrease by 1%. Making plots of 
the data reveals other interesting features not described 
by this model. 

• Histogram of tips given by customers with bins equal 
to $1 increments. Distribution of values is skewed 
right and unimodal, which says that there are few 
high tips, but lots of low tips. 

• Histogram of tips given by customers with bins equal 
to 10c increments. An interesting phenomenon is 
visible, peaks in the counts at the full and half-dollar 
amounts. This corresponds to customers rounding 
tips. This is a behaviour that is common to other 
types of purchases too, like gasoline. 

• Scatterplot of tips vs bill. We would expect to see a 
tight positive linear association, but instead see a lot 
more variation. In particular, there are more points 


• Ordination 
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in the lower right than upper left. Points in the lower 
right correspond to tips that are lower than expected, 
and it is clear that more customers are cheap rather 
than generous. 

• Scatterplot of tips vs bill separately by gender and 
smoking party. Smoking parties have a lot more 
variability in the tips that they give. Males tend to 
pay the (few) higher bills, and female non-smokers 
tend to be very consistent tippers (with the exception 
of three women). 

What is learned from the graphics is different from what 
could be learned by the modeling. You can say that these 
pictures help the data tell us a story, that we have dis¬ 
covered some features of tipping that perhaps we didn't 
anticipate in advance. 

5.6 Software 

• GGobi is a free software for interactive data visual¬ 
ization data visualization 

• CMU-DAP (Carnegie-Mellon University Data 
Analysis Package, FORTRAN source for EDA 
tools with English-style command syntax, 1977). 

• Data Applied, a comprehensive web-based data vi¬ 
sualization and data mining environment. 

• High-D for multivariate analysis using parallel coor¬ 
dinates. 

• JMP, an EDA package from SAS Institute. 

• KNIME Konstanz Information Miner - Open- 
Source data exploration platform based on Eclipse. 

• Orange, an open-source data mining software suite. 

• SOCR provides a large number of free Internet- 
accessible. 

• TinkerPlots (for upper elementary and middle 
school students). 

• Weka an open source data mining package that in¬ 
cludes visualisation and EDA tools such as targeted 
projection pursuit 

5.7 See also 

• Anscombe’s quartet, on importance of exploration 

• Predictive analytics 

• Structured data analysis (statistics) 

• Configural frequency analysis 
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Predictive analytics 


Predictive analytics encompasses a variety of statisti¬ 
cal techniques from modeling, machine learning, and 
data mining that analyze current and historical facts to 
make predictions about future, or otherwise unknown, 
events.' 1 " 2 ' 

In business, predictive models exploit patterns found in 
historical and transactional data to identify risks and op¬ 
portunities. Models capture relationships among many 
factors to allow assessment of risk or potential associated 
with a particular set of conditions, guiding decision mak¬ 
ing for candidate transactions. 131 

The defining functional effect of these technical ap¬ 
proaches is that predictive analytics provides a predictive 
score (probability) for each individual (customer, em¬ 
ployee, healthcare patient, product SKU, vehicle, com¬ 
ponent, machine, or other organizational unit) in order 
to determine, inform, or influence organizational pro¬ 
cesses that pertain across large numbers of individuals, 
such as in marketing, credit risk assessment, fraud detec¬ 
tion, manufacturing, healthcare, and government opera¬ 
tions including law enforcement. 

Predictive analytics is used in actuarial science, 141 
marketing, 1 "’ 1 financial services, 161 insurance, 
telecommunications,' 7 ' retail, 181 travel,'- 9 - 1 healthcare,' 10 ' 
pharmaceuticals' 11 ' and other fields. 

One of the most well known applications is credit scor¬ 
ing,' 1 ' which is used throughout financial services. Scor¬ 
ing models process a customer’s credit history, loan appli¬ 
cation, customer data, etc., in order to rank-order individ¬ 
uals by their likelihood of making future credit payments 
on time. 


6.1 Definition 

Predictive analytics is an area of data mining that deals 
with extracting information from data and using it to pre¬ 
dict trends and behavior patterns. Often the unknown 
event of interest is in the future, but predictive analyt¬ 
ics can be applied to any type of unknown whether it be 
in the past, present or future. For example, identifying 
suspects after a crime has been committed, or credit card 
fraud as it occurs.' 12 ' The core of predictive analytics re¬ 


lies on capturing relationships between explanatory vari¬ 
ables and the predicted variables from past occurrences, 
and exploiting them to predict the unknown outcome. It 
is important to note, however, that the accuracy and us¬ 
ability of results will depend greatly on the level of data 
analysis and the quality of assumptions. 

Predictive analytics is often defined as predicting at a 
more detailed level of granularity, i.e., generating pre¬ 
dictive scores (probabilities) for each individual organiza¬ 
tional element. This distinguishes it from forecasting. For 
example, “Predictive analytics—Technology that learns 
from experience (data) to predict the future behavior of 
individuals in order to drive better decisions.”' 13 ' 


6.2 Types 

Generally, the term predictive analytics is used to mean 
predictive modeling, “scoring” data with predictive mod¬ 
els, and forecasting. However, people are increasingly 
using the term to refer to related analytical disciplines, 
such as descriptive modeling and decision modeling or 
optimization. These disciplines also involve rigorous data 
analysis, and are widely used in business for segmentation 
and decision making, but have different purposes and the 
statistical techniques underlying them vary. 

6.2.1 Predictive models 

Predictive models are models of the relation between the 
specific performance of a unit in a sample and one or 
more known attributes or features of the unit. The ob¬ 
jective of the model is to assess the likelihood that a 
similar unit in a different sample will exhibit the spe¬ 
cific performance. This category encompasses models in 
many areas, such as marketing, where they seek out subtle 
data patterns to answer questions about customer perfor¬ 
mance, or fraud detection models. Predictive models of¬ 
ten perform calculations during live transactions, for ex¬ 
ample, to evaluate the risk or opportunity of a given cus¬ 
tomer or transaction, in order to guide a decision. With 
advancements in computing speed, individual agent mod¬ 
eling systems have become capable of simulating human 
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behaviour or reactions to given stimuli or scenarios. 

The available sample units with known attributes and 
known performances is referred to as the “training sam¬ 
ple.” The units in other samples, with known attributes 
but unknown performances, are referred to as “out of 
[training] sample” units. The out of sample bear no 
chronological relation to the training sample units. For 
example, the training sample may consists of literary at¬ 
tributes of writings by Victorian authors, with known at¬ 
tribution, and the out-of sample unit may be newly found 
writing with unknown authorship; a predictive model may 
aid in attributing a work to a known author. Another ex¬ 
ample is given by analysis of blood splatter in simulated 
crime scenes in which the out of sample unit is the ac¬ 
tual blood splatter pattern from a crime scene. The out 
of sample unit may be from the same time as the training 
units, from a previous time, or from a future time. 


6.2.2 Descriptive models 

Descriptive models quantify relationships in data in a way 
that is often used to classify customers or prospects into 
groups. Unlike predictive models that focus on predicting 
a single customer behavior (such as credit risk), descrip¬ 
tive models identify many different relationships between 
customers or products. Descriptive models do not rank- 
order customers by their likelihood of taking a particular 
action the way predictive models do. Instead, descriptive 
models can be used, for example, to categorize customers 
by their product preferences and life stage. Descriptive 
modeling tools can be utilized to develop further models 
that can simulate large number of individualized agents 
and make predictions. 


6.2.3 Decision models 

Decision models describe the relationship between all the 
elements of a decision — the known data (including re¬ 
sults of predictive models), the decision, and the forecast 
results of the decision — in order to predict the results of 
decisions involving many variables. These models can be 
used in optimization, maximizing certain outcomes while 
minimizing others. Decision models are generally used to 
develop decision logic or a set of business rules that will 
produce the desired action for every customer or circum¬ 
stance. 


6.3 Applications 

Although predictive analytics can be put to use in many 
applications, we outline a few examples where predictive 
analytics has shown positive impact in recent years. 


6.3.1 Analytical customer relationship 
management (CRM) 

Analytical Customer Relationship Management is a fre¬ 
quent commercial application of Predictive Analysis. 
Methods of predictive analysis are applied to customer 
data to pursue CRM objectives, which involve construct¬ 
ing a holistic view of the customer no matter where their 
information resides in the company or the department 
involved. CRM uses predictive analysis in applications 
for marketing campaigns, sales, and customer services to 
name a few. These tools are required in order for a com¬ 
pany to posture and focus their efforts effectively across 
the breadth of their customer base. They must analyze 
and understand the products in demand or have the po¬ 
tential for high demand, predict customers’ buying habits 
in order to promote relevant products at multiple touch 
points, and proactively identify and mitigate issues that 
have the potential to lose customers or reduce their abil¬ 
ity to gain new ones. Analytical Customer Relationship 
Management can be applied throughout the customers 
lifecycle (acquisition, relationship growth, retention, and 
win-back). Several of the application areas described be¬ 
low (direct marketing, cross-sell, customer retention) are 
part of Customer Relationship Managements. 


6.3.2 Clinical decision support systems 

Experts use predictive analysis in health care primarily to 
determine which patients are at risk of developing certain 
conditions, like diabetes, asthma, heart disease, and other 
lifetime illnesses. Additionally, sophisticated clinical de¬ 
cision support systems incorporate predictive analytics to 
support medical decision making at the point of care. A 
working definition has been proposed by Robert Hay¬ 
ward of the Centre for Health Evidence: “Clinical Deci¬ 
sion Support Systems link health observations with health 
knowledge to influence health choices by clinicians for 
improved health care.” 


6.3.3 Collection analytics 

Many portfolios have a set of delinquent customers who 
do not make their payments on time. The financial insti¬ 
tution has to undertake collection activities on these cus¬ 
tomers to recover the amounts due. A lot of collection 
resources are wasted on customers who are difficult or 
impossible to recover. Predictive analytics can help opti¬ 
mize the allocation of collection resources by identifying 
the most effective collection agencies, contact strategies, 
legal actions and other strategies to each customer, thus 
significantly increasing recovery at the same time reduc¬ 
ing collection costs. 
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6.3.4 Cross-sell 

Often corporate organizations collect and maintain abun¬ 
dant data (e.g. customer records, sale transactions) as 
exploiting hidden relationships in the data can provide a 
competitive advantage. For an organization that offers 
multiple products, predictive analytics can help analyze 
customers’ spending, usage and other behavior, leading to 
efficient cross sales, or selling additional products to cur¬ 
rent customers. 121 This directly leads to higher profitabil¬ 
ity per customer and stronger customer relationships. 

6.3.5 Customer retention 

With the number of competing services available, busi¬ 
nesses need to focus efforts on maintaining continuous 
consumer satisfaction, rewarding consumer loyalty and 
minimizing customer attrition. In addition, small in¬ 
creases in customer retention have been shown to in¬ 
crease profits disproportionately. One study concluded 
that a 5% increase in customer retention rates will in¬ 
crease profits by 25% to 95%. 1141 Businesses tend to re¬ 
spond to customer attrition on a reactive basis, acting only 
after the customer has initiated the process to terminate 
service. At this stage, the chance of changing the cus¬ 
tomer’s decision is almost impossible. Proper applica¬ 
tion of predictive analytics can lead to a more proactive 
retention strategy. By a frequent examination of a cus¬ 
tomer’s past service usage, service performance, spending 
and other behavior patterns, predictive models can de¬ 
termine the likelihood of a customer terminating service 
sometime soon. 171 An intervention with lucrative offers 
can increase the chance of retaining the customer. Silent 
attrition, the behavior of a customer to slowly but steadily 
reduce usage, is another problem that many companies 
face. Predictive analytics can also predict this behavior, 
so that the company can take proper actions to increase 
customer activity. 

6.3.6 Direct marketing 

When marketing consumer products and services, there is 
the challenge of keeping up with competing products and 
consumer behavior. Apart from identifying prospects, 
predictive analytics can also help to identify the most ef¬ 
fective combination of product versions, marketing ma¬ 
terial, communication channels and timing that should be 
used to target a given consumer. The goal of predictive 
analytics is typically to lower the cost per order or cost 
per action. 

6.3.7 Fraud detection 

Fraud is a big problem for many businesses and can be of 
various types: inaccurate credit applications, fraudulent 
transactions (both offline and online), identity thefts and 


false insurance claims. These problems plague firms of 
all sizes in many industries. Some examples of likely vic¬ 
tims are credit card issuers, insurance companies, 1151 re¬ 
tail merchants, manufacturers, business-to-business sup¬ 
pliers and even services providers. A predictive model 
can help weed out the “bads” and reduce a business’s ex¬ 
posure to fraud. 

Predictive modeling can also be used to identify high-risk 
fraud candidates in business or the public sector. Mark 
Nigrini developed a risk-scoring method to identify audit 
targets. He describes the use of this approach to detect 
fraud in the franchisee sales reports of an international 
fast-food chain. Each location is scored using 10 predic¬ 
tors. The 10 scores are then weighted to give one final 
overall risk score for each location. The same scoring ap¬ 
proach was also used to identify high-risk check kiting 
accounts, potentially fraudulent travel agents, and ques¬ 
tionable vendors. A reasonably complex model was used 
to identify fraudulent monthly reports submitted by divi¬ 
sional controllers. 1161 

The Internal Revenue Service (IRS) of the United States 
also uses predictive analytics to mine tax returns and iden¬ 
tify tax fraud. 1151 

Recent advancements in technology have also introduced 
predictive behavior analysis for web fraud detection. This 
type of solution utilizes heuristics in order to study normal 
web user behavior and detect anomalies indicating fraud 
attempts. 

6.3.8 Portfolio, product or economy-level 
prediction 

Often the focus of analysis is not the consumer but the 
product, portfolio, firm, industry or even the economy. 
For example, a retailer might be interested in predicting 
store-level demand for inventory management purposes. 
Or the Federal Reserve Board might be interested in pre¬ 
dicting the unemployment rate for the next year. These 
types of problems can be addressed by predictive ana¬ 
lytics using time series techniques (see below). They can 
also be addressed via machine learning approaches which 
transform the original time series into a feature vector 
space, where the learning algorithm finds patterns that 
have predictive power. 11 711181 

6.3.9 Risk management 

When employing risk management techniques, the re¬ 
sults are always to predict and benefit from a future sce¬ 
nario. The Capital asset pricing model (CAP-M) “pre¬ 
dicts” the best portfolio to maximize return. Probabilistic 
Risk Assessment (PRA)—when combined with mini- 
Delphi Techniques and statistical approaches yields ac¬ 
curate forecasts and RiskAoA is a stand-alone predic¬ 
tive tool. 1191 These are three examples of approaches that 


6.5. ANALYTICAL TECHNIQUES 


57 


can extend from project to market, and from near to 
long term. Underwriting (see below) and other busi¬ 
ness approaches identify risk management as a predictive 
method. 


6.3.10 Underwriting 

Many businesses have to account for risk exposure due 
to their different services and determine the cost needed 
to cover the risk. For example, auto insurance providers 
need to accurately determine the amount of premium to 
charge to cover each automobile and driver. A financial 
company needs to assess a borrower’s potential and abil¬ 
ity to pay before granting a loan. For a health insurance 
provider, predictive analytics can analyze a few years of 
past medical claims data, as well as lab, pharmacy and 
other records where available, to predict how expensive 
an enrollee is likely to be in the future. Predictive analyt¬ 
ics can help underwrite these quantities by predicting the 
chances of illness, default, bankruptcy, etc. Predictive 
analytics can streamline the process of customer acquisi¬ 
tion by predicting the future risk behavior of a customer 
using application level data. 141 Predictive analytics in the 
form of credit scores have reduced the amount of time it 
takes for loan approvals, especially in the mortgage mar¬ 
ket where lending decisions are now made in a matter of 
hours rather than days or even weeks. Proper predictive 
analytics can lead to proper pricing decisions, which can 
help mitigate future risk of default. 

6.4 Technology and big data influ¬ 
ences 

Big data is a collection of data sets that are so large 
and complex that they become awkward to work with 
using traditional database management tools. The vol¬ 
ume, variety and velocity of big data have introduced 
challenges across the board for capture, storage, search, 
sharing, analysis, and visualization. Examples of big data 
sources include web logs, RFID, sensor data, social net¬ 
works, Internet search indexing, call detail records, mil¬ 
itary surveillance, and complex data in astronomic, bio¬ 
geochemical, genomics, and atmospheric sciences. Big 
Data is the core of most predictive analytic services of¬ 
fered by IT organizations. 1201 Thanks to technological ad¬ 
vances in computer hardware — faster CPUs, cheaper 
memory, and MPP architectures — and new technolo¬ 
gies such as Hadoop, MapReduce, and in-database and 
text analytics for processing big data, it is now feasible to 
collect, analyze, and mine massive amounts of structured 
and unstructured data for new insights. 1151 Today, explor¬ 
ing big data and using predictive analytics is within reach 
of more organizations than ever before and new methods 

that are capable for handling such datasets are proposed 
[ 21 ] [ 22 ] 


6.5 Analytical Techniques 

The approaches and techniques used to conduct predic¬ 
tive analytics can broadly be grouped into regression tech¬ 
niques and machine learning techniques. 

6.5.1 Regression techniques 

Regression models are the mainstay of predictive analyt¬ 
ics. The focus lies on establishing a mathematical equa¬ 
tion as a model to represent the interactions between the 
different variables in consideration. Depending on the 
situation, there are a wide variety of models that can be 
applied while performing predictive analytics. Some of 
them are briefly discussed below. 

Linear regression model 

The linear regression model analyzes the relationship be¬ 
tween the response or dependent variable and a set of in¬ 
dependent or predictor variables. This relationship is ex¬ 
pressed as an equation that predicts the response variable 
as a linear function of the parameters. These parameters 
are adjusted so that a measure of fit is optimized. Much 
of the effort in model fitting is focused on minimizing the 
size of the residual, as well as ensuring that it is randomly 
distributed with respect to the model predictions. 

The goal of regression is to select the parameters of the 
model so as to minimize the sum of the squared residu¬ 
als. This is referred to as ordinary least squares (OLS) 
estimation and results in best linear unbiased estimates 
(BLUE) of the parameters if and only if the Gauss- 
Markov assumptions are satisfied. 

Once the model has been estimated we would be inter¬ 
ested to know if the predictor variables belong in the 
model - i.e. is the estimate of each variable’s contribution 
reliable? To do this we can check the statistical signifi¬ 
cance of the model’s coefficients which can be measured 
using the t-statistic. This amounts to testing whether the 
coefficient is significantly different from zero. How well 
the model predicts the dependent variable based on the 
value of the independent variables can be assessed by us¬ 
ing the R 2 statistic. It measures predictive power of the 
model i.e. the proportion of the total variation in the de¬ 
pendent variable that is “explained” (accounted for) by 
variation in the independent variables. 

Discrete choice models 

Multivariate regression (above) is generally used when 
the response variable is continuous and has an unbounded 
range. Often the response variable may not be continuous 
but rather discrete. While mathematically it is feasible to 
apply multivariate regression to discrete ordered depen¬ 
dent variables, some of the assumptions behind the theory 
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of multivariate linear regression no longer hold, and there 
are other techniques such as discrete choice models which 
are better suited for this type of analysis. If the dependent 
variable is discrete, some of those superior methods are 
logistic regression, multinomial logit and probit models. 
Logistic regression and probit models are used when the 
dependent variable is binary. 

Logistic regression 

For more details on this topic, see logistic regression. 

In a classification setting, assigning outcome probabilities 
to observations can be achieved through the use of a logis¬ 
tic model, which is basically a method which transforms 
information about the binary dependent variable into an 
unbounded continuous variable and estimates a regular 
multivariate model (See Allison’s Logistic Regression for 
more information on the theory of Logistic Regression). 

The Wald and likelihood-ratio test are used to test the 
statistical significance of each coefficient b in the model 
(analogous to the t tests used in OLS regression; see 
above). A test assessing the goodness-of-fit of a classi¬ 
fication model is the “percentage correctly predicted”. 

Multinomial logistic regression 

An extension of the binary logit model to cases where 
the dependent variable has more than 2 categories is the 
multinomial logit model. In such cases collapsing the data 
into two categories might not make good sense or may 
lead to loss in the richness of the data. The multinomial 
logit model is the appropriate technique in these cases, 
especially when the dependent variable categories are not 
ordered (for examples colors like red, blue, green). Some 
authors have extended multinomial regression to include 
feature selection/importance methods such as Random 
multinomial logit. 

Probit regression 

Probit models offer an alternative to logistic regres¬ 
sion for modeling categorical dependent variables. Even 
though the outcomes tend to be similar, the underlying 
distributions are different. Probit models are popular in 
social sciences like economics. 

A good way to understand the key difference between 
probit and logit models is to assume that there is a latent 
variable z. 

We do not observe z but instead observe y which takes the 
value 0 or 1. In the logit model we assume that y follows a 
logistic distribution. In the probit model we assume that y 
follows a standard normal distribution. Note that in social 
sciences (e.g. economics), probit is often used to model 


situations where the observed variable y is continuous but 
takes values between 0 and 1. 

Logit versus probit 

The Probit model has been around longer than the logit 
model. They behave similarly, except that the logistic dis¬ 
tribution tends to be slightly flatter tailed. One of the rea¬ 
sons the logit model was formulated was that the probit 
model was computationally difficult due to the require¬ 
ment of numerically calculating integrals. Modern com¬ 
puting however has made this computation fairly simple. 
The coefficients obtained from the logit and probit model 
are fairly close. However, the odds ratio is easier to in¬ 
terpret in the logit model. 

Practical reasons for choosing the probit model over the 
logistic model would be: 

• There is a strong belief that the underlying distribu¬ 
tion is normal 

• The actual event is not a binary outcome (e.g., 
bankruptcy status) but a proportion (e.g., proportion 
of population at different debt levels). 

Time series models 

Time series models are used for predicting or forecasting 
the future behavior of variables. These models account 
for the fact that data points taken over time may have an 
internal structure (such as autocorrelation, trend or sea¬ 
sonal variation) that should be accounted for. As a result 
standard regression techniques cannot be applied to time 
series data and methodology has been developed to de¬ 
compose the trend, seasonal and cyclical component of 
the series. Modeling the dynamic path of a variable can 
improve forecasts since the predictable component of the 
series can be projected into the future. 

Time series models estimate difference equations con¬ 
taining stochastic components. Two commonly used 
forms of these models are autoregressive models (AR) 
and moving average (MA) models. The Box-Jenkins 
methodology (1976) developed by George Box and G.M. 
Jenkins combines the AR and MA models to produce 
the ARM A (autoregressive moving average) model which 
is the cornerstone of stationary time series analysis. 
ARIMA (autoregressive integrated moving average mod¬ 
els) on the other hand are used to describe non-stationary 
time series. Box and Jenkins suggest differencing a non 
stationary time series to obtain a stationary series to 
which an ARMA model can be applied. Non stationary 
time series have a pronounced trend and do not have a 
constant long-run mean or variance. 

Box and Jenkins proposed a three stage methodology 
which includes: model identification, estimation and val¬ 
idation. The identification stage involves identifying if 
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the series is stationary or not and the presence of sea¬ 
sonality by examining plots of the series, autocorrelation 
and partial autocorrelation functions. In the estimation 
stage, models are estimated using non-linear time series 
or maximum likelihood estimation procedures. Finally 
the validation stage involves diagnostic checking such as 
plotting the residuals to detect outliers and evidence of 
model fit. 

In recent years time series models have become 
more sophisticated and attempt to model condi¬ 
tional heteroskedasticity with models such as ARCH 
(autoregressive conditional heteroskedasticity) and 
GARCH (generalized autoregressive conditional het¬ 
eroskedasticity) models frequently used for financial time 
series. In addition time series models are also used to 
understand inter-relationships among economic variables 
represented by systems of equations using VAR (vector 
autoregression) and structural VAR models. 


Survival or duration analysis 

Survival analysis is another name for time to event anal¬ 
ysis. These techniques were primarily developed in the 
medical and biological sciences, but they are also widely 
used in the social sciences like economics, as well as in 
engineering (reliability and failure time analysis). 

Censoring and non-normality, which are characteristic of 
survival data, generate difficulty when trying to analyze 
the data using conventional statistical models such as mul¬ 
tiple linear regression. The normal distribution, being a 
symmetric distribution, takes positive as well as negative 
values, but duration by its very nature cannot be negative 
and therefore normality cannot be assumed when dealing 
with duration/survival data. Hence the normality assump¬ 
tion of regression models is violated. 

The assumption is that if the data were not censored it 
would be representative of the population of interest. In 
survival analysis, censored observations arise whenever 
the dependent variable of interest represents the time to 
a terminal event, and the duration of the study is limited 
in time. 

An important concept in survival analysis is the hazard 
rate, defined as the probability that the event will occur 
at time t conditional on surviving until time t. Another 
concept related to the hazard rate is the survival function 
which can be defined as the probability of surviving to 
time t. 

Most models try to model the hazard rate by choosing 
the underlying distribution depending on the shape of the 
hazard function. A distribution whose hazard function 
slopes upward is said to have positive duration depen¬ 
dence, a decreasing hazard shows negative duration de¬ 
pendence whereas constant hazard is a process with no 
memory usually characterized by the exponential distri¬ 
bution. Some of the distributional choices in survival 


models are: F, gamma, Weibull, log normal, inverse nor¬ 
mal, exponential etc. All these distributions are for a non¬ 
negative random variable. 

Duration models can be parametric, non-parametric or 
semi-parametric. Some of the models commonly used 
are Kaplan-Meier and Cox proportional hazard model 
(non parametric). 

Classification and regression trees 

Main article: decision tree learning 

Globally-optimal classification tree analysis (GO-CTA) 
(also called hierarchical optimal discriminant analysis) is 
a generalization of optimal discriminant analysis that may 
be used to identify the statistical model that has maxi¬ 
mum accuracy for predicting the value of a categorical 
dependent variable for a dataset consisting of categori¬ 
cal and continuous variables. The output of HODA is a 
non-orthogonal tree that combines categorical variables 
and cut points for continuous variables that yields max¬ 
imum predictive accuracy, an assessment of the exact 
Type I error rate, and an evaluation of potential cross- 
generalizability of the statistical model. Hierarchical op¬ 
timal discriminant analysis may be thought of as a gener¬ 
alization of Fisher’s linear discriminant analysis. Optimal 
discriminant analysis is an alternative to AN OVA (analy¬ 
sis of variance) and regression analysis, which attempt to 
express one dependent variable as a linear combination of 
other features or measurements. However, ANOVA and 
regression analysis give a dependent variable that is a nu¬ 
merical variable, while hierarchical optimal discriminant 
analysis gives a dependent variable that is a class variable. 

Classification and regression trees (CART) are a non- 
parametric decision tree learning technique that produces 
either classification or regression trees, depending on 
whether the dependent variable is categorical or numeric, 
respectively. 

Decision trees are formed by a collection of rules based 
on variables in the modeling data set: 

• Rules based on variables’ values are selected to get 
the best split to differentiate observations based on 
the dependent variable 

• Once a rule is selected and splits a node into two, the 
same process is applied to each “child” node (i.e. it 
is a recursive procedure) 

• Splitting stops when CART detects no further gain 
can be made, or some pre-set stopping rules are met. 
(Alternatively, the data are split as much as possible 
and then the tree is later pruned.) 

Each branch of the tree ends in a terminal node. Each 
observation falls into one and exactly one terminal node. 
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and each terminal node is uniquely defined by a set of 
rules. 

A very popular method for predictive analytics is Leo 
Breiman’s Random forests or derived versions of this 
technique like Random multinomial logit. 

Multivariate adaptive regression splines 

Multivariate adaptive regression splines (MARS) is a non- 
parametric technique that builds flexible models by fitting 
piecewise linear regressions. 

An important concept associated with regression splines 
is that of a knot. Knot is where one local regression model 
gives way to another and thus is the point of intersection 
between two splines. 

In multivariate and adaptive regression splines, basis 
functions are the tool used for generalizing the search for 
knots. Basis functions are a set of functions used to repre¬ 
sent the information contained in one or more variables. 
Multivariate and Adaptive Regression Splines model al¬ 
most always creates the basis functions in pairs. 

Multivariate and adaptive regression spline approach de¬ 
liberately overfits the model and then prunes to get to the 
optimal model. The algorithm is computationally very in¬ 
tensive and in practice we are required to specify an upper 
limit on the number of basis functions. 

6.5.2 Machine learning techniques 

Machine learning, a branch of artificial intelligence, was 
originally employed to develop techniques to enable com¬ 
puters to learn. Today, since it includes a number of 
advanced statistical methods for regression and classifi¬ 
cation, it finds application in a wide variety of fields in¬ 
cluding medical diagnostics, credit card fraud detection, 
face and speech recognition and analysis of the stock mar¬ 
ket. In certain applications it is sufficient to directly pre¬ 
dict the dependent variable without focusing on the un¬ 
derlying relationships between variables. In other cases, 
the underlying relationships can be very complex and the 
mathematical form of the dependencies unknown. For 
such cases, machine learning techniques emulate human 
cognition and learn from training examples to predict fu¬ 
ture events. 

A brief discussion of some of these methods used com¬ 
monly for predictive analytics is provided below. A de¬ 
tailed study of machine learning can be found in Mitchell 
(1997). 

Neural networks 

Neural networks are nonlinear sophisticated model¬ 
ing techniques that are able to model complex func¬ 
tions. They can be applied to problems of prediction. 


classification or control in a wide spectrum of fields such 
as finance, cognitive psychology/neuroscience, medicine, 
engineering, and physics. 

Neural networks are used when the exact nature of the re¬ 
lationship between inputs and output is not known. A key 
feature of neural networks is that they learn the relation¬ 
ship between inputs and output through training. There 
are three types of training in neural networks used by 
different networks, supervised and unsupervised training, 
reinforcement learning, with supervised being the most 
common one. 

Some examples of neural network training techniques 
are backpropagation, quick propagation, conjugate gra¬ 
dient descent, projection operator, Delta-Bar-Delta etc. 
Some unsupervised network architectures are multilayer 
perceptrons, Kohonen networks, Hopfield networks, etc. 

Multilayer Perceptron (MLP) 

The Multilayer Perceptron (MLP) consists of an input 
and an output layer with one or more hidden layers of 
nonlinearly-activating nodes or sigmoid nodes. This is 
determined by the weight vector and it is necessary to 
adjust the weights of the network. The backpropagation 
employs gradient fall to minimize the squared error be¬ 
tween the network output values and desired values for 
those outputs. The weights adjusted by an iterative pro¬ 
cess of repetitive present of attributes. Small changes in 
the weight to get the desired values are done by the pro¬ 
cess called training the net and is done by the training set 
(learning rule). 

Radial basis functions 

A radial basis function (RBF) is a function which has built 
into it a distance criterion with respect to a center. Such 
functions can be used very efficiently for interpolation 
and for smoothing of data. Radial basis functions have 
been applied in the area of neural networks where they 
are used as a replacement for the sigmoidal transfer func¬ 
tion. Such networks have 3 layers, the input layer, the 
hidden layer with the RBF non-linearity and a linear out¬ 
put layer. The most popular choice for the non-linearity 
is the Gaussian. RBF networks have the advantage of not 
being locked into local minima as do the feed-forward 
networks such as the multilayer perceptron. 

Support vector machines 

Support Vector Machines (SVM) are used to detect and 
exploit complex patterns in data by clustering, classifying 
and ranking the data. They are learning machines that are 
used to perform binary classifications and regression es¬ 
timations. They commonly use kernel based methods to 
apply linear classification techniques to non-linear classi- 
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fication problems. There are a number of types of SVM 
such as Unear, polynomial, sigmoid etc. 


Naive Bayes 

Naive Bayes based on Bayes conditional probability rule 
is used for performing classification tasks. Naive Bayes 
assumes the predictors are statisticaUy independent which 
makes it an effective classification tool that is easy to in¬ 
terpret. It is best employed when faced with the problem 
of ‘curse of dimensionality’ i.e. when the number of pre¬ 
dictors is very high. 


A>nearest neighbours 

The nearest neighbour algorithm (KNN) belongs to the 
class of pattern recognition statistical methods. The 
method does not impose a priori any assumptions about 
the distribution from which the modeling sample is 
drawn. It involves a training set with both positive and 
negative values. A new sample is classified by calculat¬ 
ing the distance to the nearest neighbouring training case. 
The sign of that point wiU determine the classification of 
the sample. In the k-nearest neighbour classifier, the k 
nearest points are considered and the sign of the major¬ 
ity is used to classify the sample. The performance of 
the kNN algorithm is influenced by three main factors: 
(1) the distance measure used to locate the nearest neigh¬ 
bours; (2) the decision rule used to derive a classifica¬ 
tion from the k-nearest neighbours; and (3) the number 
of neighbours used to classify the new sample. It can be 
proved that, unlike other methods, this method is univer¬ 
sally asymptotically convergent, i.e.: as the size of the 
training set increases, if the observations are independent 
and identically distributed (i.i.d.), regardless of the dis¬ 
tribution from which the sample is drawn, the predicted 
class will converge to the class assignment that minimizes 
misclassification error. See Devroy et al. 


Geospatial predictive modeling 

Conceptually, geospatial predictive modeling is rooted in 
the principle that the occurrences of events being mod¬ 
eled are limited in distribution. Occurrences of events 
are neither uniform nor random in distribution - there are 
spatial environment factors (infrastructure, sociocultural, 
topographic, etc.) that constrain and influence where the 
locations of events occur. Geospatial predictive modeling 
attempts to describe those constraints and influences by 
spatially correlating occurrences of historical geospatial 
locations with environmental factors that represent those 
constraints and influences. Geospatial predictive model¬ 
ing is a process for analyzing events through a geographic 
filter in order to make statements of likelihood for event 
occurrence or emergence. 


6.6 Tools 

Historically, using predictive analytics tools—as well as 
understanding the results they delivered—required ad¬ 
vanced skills. However, modern predictive analytics tools 
are no longer restricted to IT specialists. As more orga¬ 
nizations adopt predictive analytics into decision-making 
processes and integrate it into their operations, they are 
creating a shift in the market toward business users as the 
primary consumers of the information. Business users 
want tools they can use on their own. Vendors are re¬ 
sponding by creating new software that removes the math¬ 
ematical complexity, provides user-friendly graphic in¬ 
terfaces and/or builds in short cuts that can, for example, 
recognize the kind of data available and suggest an appro¬ 
priate predictive model. 123 ' Predictive analytics tools have 
become sophisticated enough to adequately present and 
dissect data problems, so that any data-sawy informa¬ 
tion worker can utilize them to analyze data and retrieve 
meaningful, useful results. [2 ' For example, modern tools 
present findings using simple charts, graphs, and scores 
that indicate the likelihood of possible outcomes. 1241 

There are numerous tools available in the marketplace 
that help with the execution of predictive analytics. These 
range from those that need very little user sophistication 
to those that are designed for the expert practitioner. The 
difference between these tools is often in the level of cus¬ 
tomization and heavy data lifting allowed. 

Notable open source predictive analytic tools include: 

• scikit-learn 

• KNIME 

• OpenNN 

• Orange 

• R 

• Weka 

• GNU Octave 

• Apache Mahout 

Notable commercial predictive analytic tools include: 

• Alpine Data Labs 

• BIRT Analytics 

• Angoss KnowledgeSTUDIO 

• IBM SPSS Statistics and IBM SPSS Modeler 

• KXEN Modeler 

• Mathematica 

• MATLAB 
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• Minitab 

• Neural Designer 

• Oracle Data Mining (ODM) 

• Pervasive 

• Predixion Software 

• RapidMiner 

• RCASE 

• Revolution Analytics 

• SAP 

• SAS and SAS Enterprise Miner 

• STATA 

• STATISTICA 

• T1BCO 

The most popular commercial predictive analytics soft¬ 
ware packages according to the Rexer Analytics Sur¬ 
vey for 2013 are IBM SPSS Modeler, SAS Enterprise 
Miner, and Dell Statistica <http://www.rexeranalytics. 
com/Data- Miner- Survey- 2013 -Intro .html> 

6.6.1 PMML 

In an attempt to provide a standard language for express¬ 
ing predictive models, the Predictive Model Markup Lan¬ 
guage (PMML) has been proposed. Such an XML-based 
language provides a way for the different tools to de¬ 
fine predictive models and to share these between PMML 
compliant applications. PMML 4.0 was released in June, 
2009. 


6.7 Criticism 

There are plenty of skeptics when it comes to comput¬ 
ers and algorithms abilities to predict the future, includ¬ 
ing Gary King, a professor from Harvard University and 
the director of the Institute for Quantitative Social Sci¬ 
ence. 1251 People are influenced by their environment in 
innumerable ways. Trying to understand what people will 
do next assumes that all the influential variables can be 
known and measured accurately. “People’s environments 
change even more quickly than they themselves do. Ev¬ 
erything from the weather to their relationship with their 
mother can change the way people think and act. All of 
those variables are unpredictable. How they will impact 
a person is even less predictable. If put in the exact same 
situation tomorrow, they may make a completely differ¬ 
ent decision. This means that a statistical prediction is 
only valid in sterile laboratory conditions, which suddenly 
isn't as useful as it seemed before.” 1261 


6.8 See also 

• Criminal Reduction Utilising Statistical History 

• Data mining 

• Learning analytics 

• Odds algorithm 

• Pattern recognition 

• Prescriptive analytics 

• Predictive modeling 

• RiskAoA a predictive tool for discriminating future 
decisions. 
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Chapter 7 

Business intelligence 


Business intelligence (BI) is the set of techniques and 
tools for the transformation of raw data into meaningful 
and useful information for business analysis purposes. BI 
technologies are capable of handling large amounts of un¬ 
structured data to help identify, develop and otherwise 
create new strategic business opportunities. The goal of 
BI is to allow for the easy interpretation of these large vol¬ 
umes of data. Identifying new opportunities and imple¬ 
menting an effective strategy based on insights can pro¬ 
vide businesses with a competitive market advantage and 
long-term stability. 11 * 

BI technologies provide historical, current and predictive 
views of business operations. Common functions of busi¬ 
ness intelligence technologies are reporting, online ana¬ 
lytical processing, analytics, data mining, process mining, 
complex event processing, business performance man¬ 
agement, benchmarking, text mining, predictive analytics 
and prescriptive analytics. 

BI can be used to support a wide range of business de¬ 
cisions ranging from operational to strategic. Basic op¬ 
erating decisions include product positioning or pricing. 
Strategic business decisions include priorities, goals and 
directions at the broadest level. In all cases, BI is most ef¬ 
fective when it combines data derived from the market in 
which a company operates (external data) with data from 
company sources internal to the business such as financial 
and operations data (internal data). When combined, ex¬ 
ternal and internal data can provide a more complete pic¬ 
ture which, in effect, creates an “intelligence” that cannot 
be derived by any singular set of data. 12 * 

7.1 Components 

Business intelligence is made up of an increasing number 
of components including: 

• Multidimensional aggregation and allocation 

• Denormalization, tagging and standardization 

• Realtime reporting with analytical alert 

• A method of interfacing with unstructured data 
sources 


• Group consolidation, budgeting and rolling forecasts 

• Statistical inference and probabilistic simulation 

• Key performance indicators optimization 

• Version control and process management 

• Open item management 


7.2 History 

The term “Business Intelligence” was originally coined by 
Richard Millar Devens’ in the ‘Cyclopaedia of Commer¬ 
cial and Business Anecdotes’ from 1865. Devens used 
the term to describe how the banker. Sir Henry Fur- 
nese, gained profit by receiving and acting upon infor¬ 
mation about his environment, prior to his competitors. 
“Throughout Holland, Flanders, France, and Germany, 
he maintained a complete and perfect train of business in¬ 
telligence. The news of the many battles fought was thus 
received first by him, and the fall of Namur added to his 
profits, owing to his early receipt of the news.” (Devens, 
(1865), p. 210). The ability to collect and react accord¬ 
ingly based on the information retrieved, an ability that 
Furnese excelled in, is today still at the very heart of BI.* 31 

In a 1958 article, IBM researcher Hans Peter Luhn used 
the term business intelligence. He employed the Web¬ 
ster’s dictionary definition of intelligence: “the ability 
to apprehend the interrelationships of presented facts in 
such a way as to guide action towards a desired goal.” 14 * 

Business intelligence as it is understood today is said to 
have evolved from the decision support systems (DSS) 
that began in the 1960s and developed throughout the 
mid-1980s. DSS originated in the computer-aided mod¬ 
els created to assist with decision making and planning. 
From DSS, data warehouses. Executive Information Sys¬ 
tems, OLAP and business intelligence came into focus 
beginning in the late 80s. 

In 1988, an Italian-Dutch-French-English consortium or¬ 
ganized an international meeting on the Multiway Data 
Analysis in Rome.* 51 The ultimate goal is to reduce the 
multiple dimensions down to one or two (by detecting 
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the patterns within the data) that can then be presented 
to human decision-makers. 

In 1989, Howard Dresner (later a Gartner Group an¬ 
alyst) proposed “business intelligence” as an umbrella 
term to describe “concepts and methods to improve 
business decision making by using fact-based support 
systems.” 161 It was not until the late 1990s that this us¬ 
age was widespread. 171 


7.3 Data warehousing 

Often BI applications use data gathered from a data ware¬ 
house (DW) or from a data mart, and the concepts of 
BI and DW sometimes combine as "BI/DW" |X or as 
"BIDW". A data warehouse contains a copy of analyt¬ 
ical data that facilitates decision support. However, not 
all data warehouses serve for business intelligence, nor do 
all business intelligence applications require a data ware¬ 
house. 

To distinguish between the concepts of business intelli¬ 
gence and data warehouses, Forrester Research defines 
business intelligence in one of two ways: 

1. Using a broad definition: “Business Intelligence 
is a set of methodologies, processes, architec¬ 
tures, and technologies that transform raw data into 
meaningful and useful information used to enable 
more effective strategic, tactical, and operational in¬ 
sights and decision-making.” 191 Under this defini¬ 
tion, business intelligence also includes technologies 
such as data integration, data quality, data warehous¬ 
ing, master-data management, text- and content- 
analytics, and many others that the market some¬ 
times lumps into the "Information Management" 
segment. Therefore, Forrester refers to data prepa¬ 
ration and data usage as two separate but closely 
linked segments of the business-intelligence archi¬ 
tectural stack. 

2. Forrester defines the narrower business-intelligence 
market as,".. .referring to just the top layers of the BI 
architectural stack such as reporting, analytics and 
dashboards. ” [10] 

7.4 Comparison with competitive 
intelligence 

Though the term business intelligence is sometimes a 
synonym for competitive intelligence (because they both 
support decision making), BI uses technologies, pro¬ 
cesses, and applications to analyze mostly internal, struc¬ 
tured data and business processes while competitive in¬ 
telligence gathers, analyzes and disseminates information 


with a topical focus on company competitors. If under¬ 
stood broadly, business intelligence can include the subset 
of competitive intelligence. 1111 

7.5 Comparison with business an¬ 
alytics 

Business intelligence and business analytics are some¬ 
times used interchangeably, but there are alternate 
definitions. 1121 One definition contrasts the two, stat¬ 
ing that the term business intelligence refers to collect¬ 
ing business data to find information primarily through 
asking questions, reporting, and online analytical pro¬ 
cesses. Business analytics, on the other hand, uses statis¬ 
tical and quantitative tools for explanatory and predictive 
modeling. [131 

In an alternate definition, Thomas Davenport, professor 
of information technology and management at Babson 
College argues that business intelligence should be di¬ 
vided into querying, reporting. Online analytical process¬ 
ing (OLAP), an “alerts” tool, and business analytics. In 
this definition, business analytics is the subset of BI fo¬ 
cusing on statistics, prediction, and optimization, rather 
than the reporting functionality. 1141 


7.6 Applications in an enterprise 

Business intelligence can be applied to the following busi¬ 
ness purposes, in order to drive business value. 

1. Measurement - program that creates a hierarchy 
of performance metrics (see also Metrics Refer¬ 
ence Model) and benchmarking that informs busi¬ 
ness leaders about progress towards business goals 
(business process management). 

2. Analytics - program that builds quantitative pro¬ 
cesses for a business to arrive at optimal deci¬ 
sions and to perform business knowledge discovery. 
Frequently involves: data mining, process mining, 
statistical analysis, predictive analytics, predictive 
modeling, business process modeling, data lineage, 
complex event processing and prescriptive analytics. 

3. Reporting/enterprise reporting - program that 
builds infrastructure for strategic reporting to serve 
the strategic management of a business, not opera¬ 
tional reporting. Frequently involves data visualiza¬ 
tion, executive information system and OLAP. 

4. Collaboration/collaboration platform - program that 
gets different areas (both inside and outside the busi¬ 
ness) to work together through data sharing and 
electronic data interchange. 
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5. Knowledge management - program to make the 
company data-driven through strategies and prac¬ 
tices to identify, create, represent, distribute, and 
enable adoption of insights and experiences that are 
true business knowledge. Knowledge management 
leads to learning management and regulatory com¬ 
pliance. 

In addition to the above, business intelligence can provide 
a pro-active approach, such as alert functionality that im¬ 
mediately notifies the end-user if certain conditions are 
met. For example, if some business metric exceeds a 
pre-defined threshold, the metric will be highlighted in 
standard reports, and the business analyst may be alerted 
via e-mail or another monitoring service. This end-to- 
end process requires data governance, which should be 
handled by the expert. 

7.7 Prioritization of projects 

It can be difficult to provide a positive business case for 
business intelligence initiatives, and often the projects 
must be prioritized through strategic initiatives. BI 
projects can attain higher prioritization within the orga¬ 
nization if managers consider the following: 

• As described by Kimball 1151 the BI manager must 
determine the tangible benefits such as eliminated 
cost of producing legacy reports. 

• Data access for the entire organization must be 
enforced. 1161 In this way even a small benefit, such 
as a few minutes saved, makes a difference when 
multiplied by the number of employees in the entire 
organization. 

• As described by Ross, Weil & Roberson for En¬ 
terprise Architecture, 1171 managers should also con¬ 
sider letting the BI project be driven by other busi¬ 
ness initiatives with excellent business cases. To 
support this approach, the organization must have 
enterprise architects who can identify suitable busi¬ 
ness projects. 

• Using a structured and quantitative methodology to 
create defensible prioritization in line with the ac¬ 
tual needs of the organization, such as a weighted 
decision matrix. 1181 

7.8 Success factors of implementa¬ 
tion 

According to Kimball et al., there are three critical areas 
that organizations should assess before getting ready to do 
a BI project: 1191 


1. The level of commitment and sponsorship of the 
project from senior management 

2. The level of business need for creating a BI imple¬ 
mentation 

3. The amount and quality of business data available. 

7.8.1 Business sponsorship 

The commitment and sponsorship of senior management 
is according to Kimball etal., the most important criteria 
for assessment. 1201 This is because having strong manage¬ 
ment backing helps overcome shortcomings elsewhere in 
the project. However, as Kimball et al. state: “even the 
most elegantly designed DW/BI system cannot overcome 
a lack of business [management] sponsorship”. 1211 

It is important that personnel who participate in the 
project have a vision and an idea of the benefits and draw¬ 
backs of implementing a BI system. The best business 
sponsor should have organizational clout and should be 
well connected within the organization. It is ideal that the 
business sponsor is demanding but also able to be realis¬ 
tic and supportive if the implementation runs into delays 
or drawbacks. The management sponsor also needs to 
be able to assume accountability and to take responsibil¬ 
ity for failures and setbacks on the project. Support from 
multiple members of the management ensures the project 
does not fail if one person leaves the steering group. How¬ 
ever, having many managers work together on the project 
can also mean that there are several different interests that 
attempt to pull the project in different directions, such as 
if different departments want to put more emphasis on 
their usage. This issue can be countered by an early and 
specific analysis of the business areas that benefit the most 
from the implementation. All stakeholders in the project 
should participate in this analysis in order for them to feel 
invested in the project and to find common ground. 

Another management problem that may be encountered 
before the start of an implementation is an overly aggres¬ 
sive business sponsor. Problems of scope creep occur 
when the sponsor requests data sets that were not spec¬ 
ified in the original planning phase. 

7.8.2 Business needs 

Because of the close relationship with senior manage¬ 
ment, another critical thing that must be assessed before 
the project begins is whether or not there is a business 
need and whether there is a clear business benefit by do¬ 
ing the implementation. 1221 The needs and benefits of the 
implementation are sometimes driven by competition and 
the need to gain an advantage in the market. Another rea¬ 
son for a business-driven approach to implementation of 
BI is the acquisition of other organizations that enlarge 
the original organization it can sometimes be beneficial to 
implement DW or BI in order to create more oversight. 
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Companies that implement BI are often large, multina¬ 
tional organizations with diverse subsidiaries. 123 ' A well- 
designed BI solution provides a consolidated view of key 
business data not available anywhere else in the organiza¬ 
tion, giving management visibility and control over mea¬ 
sures that otherwise would not exist. 

7.8.3 Amount and quality of available data 

Without proper data, or with too little quality data, any 
BI implementation fails; it does not matter how good the 
management sponsorship or business-driven motivation 
is. Before implementation it is a good idea to do data pro¬ 
filing. This analysis identifies the “content, consistency 
and structure [,.]” 1221 of the data. This should be done as 
early as possible in the process and if the analysis shows 
that data is lacking, put the project on hold temporarily 
while the IT department figures out how to properly col¬ 
lect data. 

When planning for business data and business intelligence 
requirements, it is always advisable to consider specific 
scenarios that apply to a particular organization, and then 
select the business intelligence features best suited for the 
scenario. 

Often, scenarios revolve around distinct business pro¬ 
cesses, each built on one or more data sources. These 
sources are used by features that present that data as in¬ 
formation to knowledge workers, who subsequently act 
on that information. The business needs of the organiza¬ 
tion for each business process adopted correspond to the 
essential steps of business intelligence. These essential 
steps of business intelligence include but are not limited 
to: 

1. Go through business data sources in order to collect 
needed data 

2. Convert business data to information and present ap¬ 
propriately 

3. Query and analyze data 

4. Act on the collected data 

The quality aspect in business intelligence should cover 
all the process from the source data to the final reporting. 
At each step, the quality gates are different: 

1. Source Data: 

• Data Standardization: make data comparable 
(same unit, same pattern...) 

• Master Data Management: unique referential 

2. Operational Data Store (ODS): 

• Data Cleansing: detect & correct inaccurate 
data 


• Data Profiling: check inappropriate value, 
null/empty 

3. Data warehouse: 

• Completeness: check that all expected data are 
loaded 

• Referential integrity: unique and existing ref¬ 
erential over all sources 

• Consistency between sources: check consoli¬ 
dated data vs sources 

4. Reporting: 

• Uniqueness of indicators: only one share dic¬ 
tionary of indicators 

• Formula accuracy: local reporting formula 
should be avoided or checked 


7.9 User aspect 

Some considerations must be made in order to success¬ 
fully integrate the usage of business intelligence systems 
in a company. Ultimately the BI system must be accepted 
and utilized by the users in order for it to add value to the 
organization.' 24 " 25 ' If the usability of the system is poor, 
the users may become frustrated and spend a consider¬ 
able amount of time figuring out how to use the system 
or may not be able to really use the system. If the system 
does not add value to the users' mission, they simply don't 
use it. 1251 

To increase user acceptance of a BI system, it can be ad¬ 
visable to consult business users at an early stage of the 
DW/BI lifecycle, for example at the requirements gather¬ 
ing phase. 1241 This can provide an insight into the business 
process and what the users need from the BI system. 
There are several methods for gathering this information, 
such as questionnaires and interview sessions. 

When gathering the requirements from the business users, 
the local IT department should also be consulted in order 
to determine to which degree it is possible to fulfill the 
business’s needs based on the available data. 1241 

Taking a user-centered approach throughout the design 
and development stage may further increase the chance 
of rapid user adoption of the BI system. 1251 

Besides focusing on the user experience offered by the 
BI applications, it may also possibly motivate the users to 
utilize the system by adding an element of competition. 
Kimball' 24 ' suggests implementing a function on the Busi¬ 
ness Intelligence portal website where reports on system 
usage can be found. By doing so, managers can see how 
well their departments are doing and compare themselves 
to others and this may spur them to encourage their staff 
to utilize the BI system even more. 
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In a 2007 article, H. J. Watson gives an example of how 
the competitive element can act as an incentive. 1261 Wat¬ 
son describes how a large call centre implemented per¬ 
formance dashboards for all call agents, with monthly in¬ 
centive bonuses tied to performance metrics. Also, agents 
could compare their performance to other team members. 
The implementation of this type of performance mea¬ 
surement and competition significantly improved agent 
performance. 

BI chances of success can be improved by involving 
senior management to help make BI a part of the 
organizational culture, and by providing the users with 
necessary tools, training, and support. 1261 Training en¬ 
courages more people to use the BI application. 1241 

Providing user support is necessary to maintain the BI 
system and resolve user problems. 1251 User support can 
be incorporated in many ways, for example by creating 
a website. The website should contain great content and 
tools for finding the necessary information. Furthermore, 
helpdesk support can be used. The help desk can be 
manned by power users or the DW/BI project team. 1241 

7.10 BI Portals 

A Business Intelligence portal ( BI portal) is the primary 
access interface for Data Warehouse (DW) and Business 
Intelligence (BI) applications. The BI portal is the user’s 
first impression of the DW/BI system. It is typically a 
browser application, from which the user has access to 
all the individual services of the DW/BI system, reports 
and other analytical functionality. The BI portal must be 
implemented in such a way that it is easy for the users of 
the DW/BI application to call on the functionality of the 
application. 1271 

The BI portal’s main functionality is to provide a naviga¬ 
tion system of the DW/BI application. This means that 
the portal has to be implemented in a way that the user 
has access to all the functions of the DW/BI application. 

The most common way to design the portal is to custom fit 
it to the business processes of the organization for which 
the DW/BI application is designed, in that way the portal 
can best fit the needs and requirements of its users. 1281 

The BI portal needs to be easy to use and understand, 
and if possible have a look and feel similar to other ap¬ 
plications or web content of the organization the DW/BI 
application is designed for (consistency). 

The following is a list of desirable features for web portals 
in general and BI portals in particular: 

Usable User should easily find what they need in the BI 
tool. 

Content Rich The portal is not just a report printing 
tool, it should contain more functionality such as ad¬ 
vice, help, support information and documentation. 


Clean The portal should be designed so it is easily un¬ 
derstandable and not over complex as to confuse the 
users 

Current The portal should be updated regularly. 

Interactive The portal should be implemented in a way 
that makes it easy for the user to use its functionality 
and encourage them to use the portal. Scalability 
and customization give the user the means to fit the 
portal to each user. 

Value Oriented It is important that the user has the feel¬ 
ing that the DW/BI application is a valuable resource 
that is worth working on. 


7.11 Marketplace 

There are a number of business intelligence vendors, of¬ 
ten categorized into the remaining independent “pure- 
play” vendors and consolidated “megavendors” that have 
entered the market through a recent trend 1291 of acquisi¬ 
tions in the BI industry. 1301 The business intelligence mar¬ 
ket is gradually growing. In 2012 business intelligence 
services brought in $13.1 billion in revenue. 1311 

Some companies adopting BI software decide to pick and 
choose from different product offerings (best-of-breed) 
rather than purchase one comprehensive integrated solu¬ 
tion (full-service). 1321 

7.11.1 Industry-specific 

Specific considerations for business intelligence systems 
have to be taken in some sectors such as governmental 
banking regulations. The information collected by bank¬ 
ing institutions and analyzed with BI software must be 
protected from some groups or individuals, while being 
fully available to other groups or individuals. Therefore, 
BI solutions must be sensitive to those needs and be flex¬ 
ible enough to adapt to new regulations and changes to 
existing law. 

7.12 Semi-structured or unstruc¬ 
tured data 

Businesses create a huge amount of valuable informa¬ 
tion in the form of e-mails, memos, notes from call- 
centers, news, user groups, chats, reports, web-pages, 
presentations, image-files, video-files, and marketing ma¬ 
terial and news. According to Merrill Lynch, more than 
85% of all business information exists in these forms. 
These information types are called either semi-structured 
or unstructured data. However, organizations often only 
use these documents once. 1331 
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The management of semi-structured data is recognized 
as a major unsolved problem in the information technol¬ 
ogy industry. 1341 According to projections from Gartner 
(2003), white collar workers spend anywhere from 30 to 
40 percent of their time searching, finding and assessing 
unstructured data. BI uses both structured and unstruc¬ 
tured data, but the former is easy to search, and the latter 
contains a large quantity of the information needed for 
analysis and decision making. 13411351 Because of the diffi¬ 
culty of properly searching, finding and assessing unstruc¬ 
tured or semi-structured data, organizations may not draw 
upon these vast reservoirs of information, which could 
influence a particular decision, task or project. This can 
ultimately lead to poorly informed decision making. 1331 

Therefore, when designing a business intelligence/DW- 
solution, the specific problems associated with semi- 
structured and unstructured data must be accommodated 
for as well as those for the structured data. 1351 


7.12.1 Unstructured data vs. semi- 
structured data 

Unstructured and semi-structured data have different 
meanings depending on their context. In the context of 
relational database systems, unstructured data cannot be 
stored in predictably ordered columns and rows. One type 
of unstructured data is typically stored in a BLOB (bi¬ 
nary large object), a catch-all data type available in most 
relational database management systems. Unstructured 
data may also refer to irregularly or randomly repeated 
column patterns that vary from row to row within each 
file or document. 

Many of these data types, however, like e-mails, word 
processing text files, PPTs, image-files, and video-files 
conform to a standard that offers the possibility of meta¬ 
data. Metadata can include information such as author 
and time of creation, and this can be stored in a rela¬ 
tional database. Therefore, it may be more accurate to 
talk about this as semi-structured documents or data, 1341 
but no specific consensus seems to have been reached. 

Unstructured data can also simply be the knowledge that 
business users have about future business trends. Busi¬ 
ness forecasting naturally aligns with the BI system be¬ 
cause business users think of their business in aggregate 
terms. Capturing the business knowledge that may only 
exist in the minds of business users provides some of the 
most important data points for a complete BI solution. 


7.12.2 Problems with semi-structured or 
unstructured data 

There are several challenges to developing BI with semi- 
structured data. According to Inmon & Nesavich, 1361 
some of those are: 


1. Physically accessing unstructured textual data - un¬ 
structured data is stored in a huge variety of formats. 

2. Terminology - Among researchers and analysts, 
there is a need to develop a standardized terminol¬ 
ogy- 

3. Volume of data - As stated earlier, up to 85% of all 
data exists as semi-structured data. Couple that with 
the need for word-to-word and semantic analysis. 

4. Searchability of unstructured textual data - A sim¬ 
ple search on some data, e.g. apple, results in links 
where there is a reference to that precise search 
term. (Inmon & Nesavich, 2008 ) 1361 gives an exam¬ 
ple: “a search is made on the term felony. In a sim¬ 
ple search, the term felony is used, and everywhere 
there is a reference to felony, a hit to an unstructured 
document is made. But a simple search is crude. 
It does not find references to crime, arson, murder, 
embezzlement, vehicular homicide, and such, even 
though these crimes are types of felonies.” 

7.12.3 The use of metadata 

To solve problems with searchability and assessment of 
data, it is necessary to know something about the content. 
This can be done by adding context through the use of 
metadata. 1331 Many systems already capture some meta¬ 
data (e.g. filename, author, size, etc.), but more useful 
would be metadata about the actual content - e.g. sum¬ 
maries, topics, people or companies mentioned. Two 
technologies designed for generating metadata about con¬ 
tent are automatic categorization and information extrac¬ 
tion. 


7.13 Future 

A 2009 paper predicted 1371 these developments in the 
business intelligence market: 

• Because of lack of information, processes, and tools, 
through 2012, more than 35 percent of the top 5,000 
global companies regularly fail to make insightful 
decisions about significant changes in their business 
and markets. 

• By 2012, business units will control at least 40 per¬ 
cent of the total budget for business intelligence. 

• By 2012, one-third of analytic applications ap¬ 
plied to business processes will be delivered through 
coarse-grained application mashups. 

A 2009 Information Management special report pre¬ 
dicted the top BI trends: "green computing, social 
networking services, data visualization, mobile BI, 
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predictive analytics, composite applications, cloud com¬ 
puting and multitouch.”. 1381 Research undertaken in 2014 
indicated that employees are more likely to have access 
to, and more likely to engage with, cloud-based BI tools 
than traditional tools. 1391 

Other business intelligence trends include the following: 

• Third party SOA-BI products increasingly address 
ETL issues of volume and throughput. 

• Companies embrace in-memory processing, 64-bit 
processing, and pre-packaged analytic BI applica¬ 
tions. 

• Operational applications have callable BI compo¬ 
nents, with improvements in response time, scaling, 
and concurrency. 

• Near or real time BI analytics is a baseline expecta¬ 
tion. 

• Open source BI software replaces vendor offerings. 

Other lines of research include the combined study of 
business intelligence and uncertain data. 140 ' 1411 In this 
context, the data used is not assumed to be precise, accu¬ 
rate and complete. Instead, data is considered uncertain 
and therefore this uncertainty is propagated to the results 
produced by BI. 

According to a study by the Aberdeen Group, there has 
been increasing interest in Software-as-a-Service (SaaS) 
business intelligence over the past years, with twice as 
many organizations using this deployment approach as 
one year ago - 15% in 2009 compared to 7% in 2008. 1421 

An article by Info World’s Chris Kanaracus points out 
similar growth data from research firm IDC, which pre¬ 
dicts the SaaS BI market will grow 22 percent each year 
through 2013 thanks to increased product sophistication, 
strained IT budgets, and other factors. 1431 

An analysis of top 100 Business Intelligence and Ana¬ 
lytics scores and ranks the firms based on several open 
variables [44] 


7.14 See also 

• Accounting intelligence 

• Analytic applications 

• Artificial intelligence marketing 

• Business Intelligence 2.0 

• Business process discovery 

• Business process management 

• Business activity monitoring 


• Business service management 

• Customer dynamics 

• Data Presentation Architecture 

• Data visualization 

• Decision engineering 

• Enterprise planning systems 

• Document intelligence 

• Integrated business planning 

• Location intelligence 

• Media intelligence 

• Meteorological intelligence 

• Mobile business intelligence 

• Multiway Data Analysis 

• Operational intelligence 

• Business Information Systems 

• Business intelligence tools 

• Process mining 

• Real-time business intelligence 

• Runtime intelligence 

• Sales intelligence 

• Spend management 

• Test and learn 
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Chapter 8 

Analytics 


For the ice hockey term, see Analytics (ice hockey). 

Analytics is the discovery and communication of mean¬ 
ingful patterns in data. Especially valuable in areas rich 
with recorded information, analytics relies on the simul¬ 
taneous application of statistics, computer programming 
and operations research to quantify performance. Ana¬ 
lytics often favors data visualization to communicate in¬ 
sight. 

Firms may commonly apply analytics to business data, 
to describe, predict, and improve business performance. 
Specifically, areas within analytics include predictive an¬ 
alytics, enterprise decision management, retail analyt¬ 
ics, store assortment and stock-keeping unit optimiza¬ 
tion, marketing optimization and marketing mix mod¬ 
eling, web analytics, sales force sizing and optimiza¬ 
tion, price and promotion modeling, predictive science, 
credit risk analysis, and fraud analytics. Since analyt¬ 
ics can require extensive computation (see big data), the 
algorithms and software used for analytics harness the 
most current methods in computer science, statistics, and 
mathematics. 111 


8.1 Analytics vs. analysis 

Analytics is a multidimensional discipline. There is ex¬ 
tensive use of mathematics and statistics, the use of de¬ 
scriptive techniques and predictive models to gain valu¬ 
able knowledge from data—data analysis. The insights 
from data are used to recommend action or to guide de¬ 
cision making rooted in business context. Thus, analyt¬ 
ics is not so much concerned with individual analyses or 
analysis steps, but with the entire methodology. There is a 
pronounced tendency to use the term analytics in business 
settings e.g. text analytics vs. the more generic text min¬ 
ing to emphasize this broader perspective. . There is an 
increasing use of the term advanced analytics, typically 
used to describe the technical aspects of analytics, espe¬ 
cially in the emerging fields such as the use of machine 
learning techniques like neural networks to do predictive 
modeling. 


8.2 Examples 


8.2.1 Marketing optimization 


Marketing has evolved from a creative process into a 
highly data-driven process. Marketing organizations use 
analytics to determine the outcomes of campaigns or ef¬ 
forts and to guide decisions for investment and consumer 
targeting. Demographic studies, customer segmentation, 
conjoint analysis and other techniques allow marketers 
to use large amounts of consumer purchase, survey and 
panel data to understand and communicate marketing 
strategy. 

Web analytics allows marketers to collect session-level in¬ 
formation about interactions on a website using an oper¬ 
ation called sessionization. Google Analytics is an exam¬ 
ple of a popular free analytics tools that marketers use 
for this purpose. Those interactions provide the web an¬ 
alytics information systems with the information to track 
the referrer, search keywords, IP address, and activities 
of the visitor. With this information, a marketer can im¬ 
prove the marketing campaigns, site creative content, and 
information architecture. 

Analysis techniques frequently used in marketing include 
marketing mix modeling, pricing and promotion anal¬ 
yses, sales force optimization, customer analytics e.g.: 
segmentation. Web analytics and optimization of web 
sites and online campaigns now frequently work hand in 
hand with the more traditional marketing analysis tech¬ 
niques. A focus on digital media has slightly changed 
the vocabulary so that marketing mix modeling is com¬ 
monly referred to as attribution modeling in the digital or 
Marketing mix modeling context. 

These tools and techniques support both strategic mar¬ 
keting decisions (such as how much overall to spend on 
marketing and how to allocate budgets across a portfo¬ 
lio of brands and the marketing mix) and more tactical 
campaign support in terms of targeting the best poten¬ 
tial customer with the optimal message in the most cost 
effective medium at the ideal time. 
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8.2.2 Portfolio analysis 

A common application of business analytics is portfolio 
analysis. In this, a bank or lending agency has a collec¬ 
tion of accounts of varying value and risk. The accounts 
may differ by the social status (wealthy, middle-class, 
poor, etc.) of the holder, the geographical location, its 
net value, and many other factors. The lender must bal¬ 
ance the return on the loan with the risk of default for each 
loan. The question is then how to evaluate the portfolio 
as a whole. 

The least risk loan may be to the very wealthy, but there 
are a very limited number of wealthy people. On the 
other hand there are many poor that can be lent to, but 
at greater risk. Some balance must be struck that maxi¬ 
mizes return and minimizes risk. The analytics solution 
may combine time series analysis with many other issues 
in order to make decisions on when to lend money to these 
different borrower segments, or decisions on the interest 
rate charged to members of a portfolio segment to cover 
any losses among members in that segment. 

8.2.3 Risk analytics 

Predictive models in the banking industry are developed 
to bring certainty across the risk scores for individual 
customers. Credit scores are built to predict individual’s 
delinquency behaviour and widely used to evaluate the 
credit worthiness of each applicant. Furthermore, risk 
analyses are carried out in the scientific world and the in¬ 
surance industry. 

8.2.4 Digital analytics 

Digital analytics is a set of business and technical activ¬ 
ities that define, create, collect, verify or transform digi¬ 
tal data into reporting, research, analyses, recommenda¬ 
tions, optimizations, predictions, and automations.' 2 ' 

8.2.5 Security analytics 

Security analytics refers to information technology (IT) 
solutions that gather and analyze security events to bring 
situational awareness and enable IT staff to understand 
and analyze events that pose the greatest risk. 131 Solutions 
in this area include Security information and event man¬ 
agement solutions and user behavior analytics solutions. 

8.2.6 Software analytics 

Main article: Software analytics 

Software analytics is the process of collecting information 
about the way a piece of software is used and produced. 


8.3 Challenges 

In the industry of commercial analytics software, an em¬ 
phasis has emerged on solving the challenges of analyzing 
massive, complex data sets, often when such data is in a 
constant state of change. Such data sets are commonly re¬ 
ferred to as big data. Whereas once the problems posed 
by big data were only found in the scientific community, 
today big data is a problem for many businesses that op¬ 
erate transactional systems online and, as a result, amass 
large volumes of data quickly. 141 

The analysis of unstructured data types is another 
challenge getting attention in the industry. Unstruc¬ 
tured data differs from structured data in that its for¬ 
mat varies widely and cannot be stored in traditional 
relational databases without significant effort at data 
transformation. 151 Sources of unstructured data, such as 
email, the contents of word processor documents, PDFs, 
geospatial data, etc., are rapidly becoming a relevant 
source of business intelligence for businesses, govern¬ 
ments and universities.' 6 ' For example, in Britain the dis¬ 
covery that one company was illegally selling fraudulent 
doctor’s notes in order to assist people in defrauding em¬ 
ployers and insurance companies,' 7 ' is an opportunity for 
insurance firms to increase the vigilance of their unstruc¬ 
tured data analysis. The McKinsey Global Institute es¬ 
timates that big data analysis could save the American 
health care system $300 billion per year and the Euro¬ 
pean public sector €250 billion. 181 

These challenges are the current inspiration for much of 
the innovation in modern analytics information systems, 
giving birth to relatively new machine analysis concepts 
such as complex event processing, full text search and 
analysis, and even new ideas in presentation. 19 ' One such 
innovation is the introduction of grid-like architecture in 
machine analysis, allowing increases in the speed of mas¬ 
sively parallel processing by distributing the workload to 
many computers all with equal access to the complete 
data set. 1101 

Analytics is increasingly used in education, particularly 
at the district and government office levels. However, 
the complexity of student performance measures presents 
challenges when educators try to understand and use an¬ 
alytics to discern patterns in student performance, pre¬ 
dict graduation likelihood, improve chances of student 
success, etc. For example, in a study involving districts 
known for strong data use, 48% of teachers had difficulty 
posing questions prompted by data, 36% did not compre¬ 
hend given data, and 52% incorrectly interpreted data. 1 11 ' 
To combat this, some analytics tools for educators ad¬ 
here to an over-the-counter data format (embedding la¬ 
bels, supplemental documentation, and a help system, and 
making key package/display and content decisions) to im¬ 
prove educators’ understanding and use of the analytics 
being displayed. 11 2 ' 

One more emerging challenge is dynamic regulatory 
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needs. For example, in the banking industry, Basel III 
and future capital adequacy needs are likely to make even 
smaller banks adopt internal risk models. In such cases, 
cloud computing and open source R (programming lan¬ 
guage) can help smaller banks to adopt risk analytics and 
support branch level monitoring by applying predictive 
analytics. 


8.4 Risks 

The main risk for the people is discrimination like Price 
discrimination or Statistical discrimination. 

There is also the risk that a developer could profit from 
the ideas or work done by users, like this example: Users 
could write new ideas in a note taking app, which could 
then be sent as a custom event, and the developers could 
profit from those ideas. This can happen because the 
ownership of content is usually unclear in the law. 1131 

If a user’s identity is not protected, there are more risks; 
for example, the risk that private information about users 
is made public on the internet. 

In the extreme, there is the risk that governments could 
gather too much private information, now that the gov¬ 
ernments are giving themselves more powers to access 
citizens’ information. 

Further information: Telecommunications data retention 
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Data mining 


Not to be confused with analytics, information extrac¬ 
tion, or data analysis. 

Data mining (the analysis step of the “Knowledge Dis¬ 
covery in Databases” process, or KDD), 1 11 an interdisci¬ 
plinary subfield of computer science, 121131141 is the com¬ 
putational process of discovering patterns in large data 
sets involving methods at the intersection of artificial in¬ 
telligence, machine learning, statistics, and database sys¬ 
tems. 121 The overall goal of the data mining process is 
to extract information from a data set and transform it 
into an understandable structure for further use. 121 Aside 
from the raw analysis step, it involves database and 
data management aspects, data pre-processing, model 
and inference considerations, interestingness metrics, 
complexity considerations, post-processing of discovered 
structures, visualization, and online updating. 121 

The term is a misnomer, because the goal is the ex¬ 
traction of patterns and knowledge from large amount 
of data, not the extraction of data itself. 151 It also is 
a buzzword 161 and is frequently applied to any form of 
large-scale data or information processing (collection, 
extraction, warehousing, analysis, and statistics) as well 
as any application of computer decision support sys¬ 
tem, including artificial intelligence, machine learning, 
and business intelligence. The popular book “Data min¬ 
ing: Practical machine learning tools and techniques with 
Java” 171 (which covers mostly machine learning material) 
was originally to be named just “Practical machine learn¬ 
ing”, and the term “data mining” was only added for mar¬ 
keting reasons. 181 Often the more general terms "(large 
scale) data analysis", or "analytics" - or when referring to 
actual methods, artificial intelligence and machine learn¬ 
ing - are more appropriate. 

The actual data mining task is the automatic or semi¬ 
automatic analysis of large quantities of data to ex¬ 
tract previously unknown, interesting patterns such as 
groups of data records (cluster analysis), unusual records 
(anomaly detection), and dependencies (association rule 
mining). This usually involves using database techniques 
such as spatial indices. These patterns can then be seen 
as a kind of summary of the input data, and may be used 
in further analysis or, for example, in machine learning 
and predictive analytics. For example, the data mining 


step might identify multiple groups in the data, which can 
then be used to obtain more accurate prediction results 
by a decision support system. Neither the data collection, 
data preparation, nor result interpretation and reporting 
are part of the data mining step, but do belong to the over¬ 
all KDD process as additional steps. 

The related terms data dredging , data fishing , and data 
snooping refer to the use of data mining methods to sam¬ 
ple parts of a larger population data set that are (or may 
be) too small for reliable statistical inferences to be made 
about the validity of any patterns discovered. These 
methods can, however, be used in creating new hypothe¬ 
ses to test against the larger data populations. 


9.1 Etymology 

In the 1960s, statisticians used terms like “Data Fish¬ 
ing” or “Data Dredging” to refer to what they consid¬ 
ered the bad practice of analyzing data without an a-priori 
hypothesis. The term “Data Mining” appeared around 
1990 in the database community. For a short time in 
1980s, a phrase “database mining"™, was used, but since 
it was trademarked by HNC, a San Diego-based com¬ 
pany, to pitch their Database Mining Workstation; 191 re¬ 
searchers consequently turned to “data mining”. Other 
terms used include Data Archaeology, Information Har¬ 
vesting, Information Discovery, Knowledge Extraction, 
etc. Gregory Piatetsky-Shapiro coined the term “Knowl¬ 
edge Discovery in Databases” for the first workshop on 
the same topic (KDD-1989) and this term became more 
popular in AI and Machine Learning Community. How¬ 
ever, the term data mining became more popular in the 
business and press communities. 1101 Currently, Data Min¬ 
ing and Knowledge Discovery are used interchangeably. 
Since about 2007, “Predictive Analytics” and since 2011, 
“Data Science” terms were also used to describe this field. 


9.2 Background 

The manual extraction of patterns from data has occurred 
for centuries. Early methods of identifying patterns in 
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data include Bayes’ theorem (1700s) and regression anal¬ 
ysis (1800s). The proliferation, ubiquity and increas¬ 
ing power of computer technology has dramatically in¬ 
creased data collection, storage, and manipulation abil¬ 
ity. As data sets have grown in size and complexity, di¬ 
rect “hands-on” data analysis has increasingly been aug¬ 
mented with indirect, automated data processing, aided 
by other discoveries in computer science, such as neural 
networks, cluster analysis, genetic algorithms (1950s), 
decision trees and decision rules (1960s), and support 
vector machines (1990s). Data mining is the process 
of applying these methods with the intention of uncov¬ 
ering hidden patterns 1111 in large data sets. It bridges 
the gap from applied statistics and artificial intelligence 
(which usually provide the mathematical background) to 
database management by exploiting the way data is stored 
and indexed in databases to execute the actual learning 
and discovery algorithms more efficiently, allowing such 
methods to be applied to ever larger data sets. 

9.2.1 Research and evolution 

The premier professional body in the field is the 
Association for Computing Machinery's (ACM) Special 
Interest Group (SIG) on Knowledge Discovery and Data 
Mining (SIGKDD). [12][13j Since 1989 this ACM SIG has 
hosted an annual international conference and published 
its proceedings, 1 141 and since 1999 it has published a bian¬ 
nual academic journal titled “SIGKDD Explorations”. 1 151 

Computer science conferences on data mining include: 

• CIKM Conference - ACM Conference on Informa¬ 
tion and Knowledge Management 

• DMIN Conference - International Conference on 
Data Mining 

• DMKD Conference - Research Issues on Data Min¬ 
ing and Knowledge Discovery 

• ECDM Conference - European Conference on Data 
Mining 

• ECML-PKDD Conference - European Conference 
on Machine Learning and Principles and Practice of 
Knowledge Discovery in Databases 

• EDM Conference - International Conference on 
Educational Data Mining 

• ICDM Conference - IEEE International Conference 
on Data Mining 

• KDD Conference - ACM SIGKDD Conference on 
Knowledge Discovery and Data Mining 

• MLDM Conference - Machine Learning and Data 
Mining in Pattern Recognition 


• PAKDD Conference - The annual Pacific-Asia 
Conference on Knowledge Discovery and Data Min¬ 
ing 

• PAW Conference - Predictive Analytics World 

• SDM Conference - SIAM International Conference 
on Data Mining (SIAM) 

• SSTD Symposium - Symposium on Spatial and 
Temporal Databases 

• WSDM Conference - ACM Conference on Web 
Search and Data Mining 

Data mining topics are also present on many data man- 
agement/database conferences such as the ICDE Con¬ 
ference, SIGMOD Conference and International Confer¬ 
ence on Very Large Data Bases 

9.3 Process 

The Knowledge Discovery in Databases (KDD) pro¬ 
cess is commonly defined with the stages: 

(1) Selection 

(2) Pre-processing 

(3) Transformation 

(4) Data Mining 

(5) Interpretation/Evaluation. 111 

It exists, however, in many variations on this theme, such 
as the Cross Industry Standard Process for Data Mining 
(CRISP-DM) which defines six phases: 

(1) Business Understanding 

(2) Data Understanding 

(3) Data Preparation 

(4) Modeling 

(5) Evaluation 

(6) Deployment 

or a simplified process such as (1) pre-processing, (2) data 
mining, and (3) results validation. 

Polls conducted in 2002, 2004, and 2007 show that 
the CRISP-DM methodology is the leading methodology 
used by data miners. 1 16111711181 | | le on iy other data mining 
standard named in these polls was SEMMA. However, 3- 
4 times as many people reported using CRISP-DM. Sev¬ 
eral teams of researchers have published reviews of data 
mining process models, 11911201 and Azevedo and Santos 
conducted a comparison of CRISP-DM and SEMMA in 
2008. 121] 
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9.3.1 Pre-processing 

Before data mining algorithms can be used, a target data 
set must be assembled. As data mining can only uncover 
patterns actually present in the data, the target data set 
must be large enough to contain these patterns while re¬ 
maining concise enough to be mined within an acceptable 
time limit. A common source for data is a data mart or 
data warehouse. Pre-processing is essential to analyze the 
multivariate data sets before data mining. The target set 
is then cleaned. Data cleaning removes the observations 
containing noise and those with missing data. 

9.3.2 Data mining 

Data mining involves six common classes of tasks: 111 

• Anomaly detection (Outlier/change/deviation de¬ 
tection) - The identification of unusual data records, 
that might be interesting or data errors that require 
further investigation. 

• Association rule learning (Dependency modelling) 
- Searches for relationships between variables. For 
example a supermarket might gather data on cus¬ 
tomer purchasing habits. Using association rule 
learning, the supermarket can determine which 
products are frequently bought together and use this 
information for marketing purposes. This is some¬ 
times referred to as market basket analysis. 

• Clustering - is the task of discovering groups and 
structures in the data that are in some way or an¬ 
other “similar”, without using known structures in 
the data. 

• Classification - is the task of generalizing known 
structure to apply to new data. For example, an e- 
mail program might attempt to classify an e-mail as 
“legitimate” or as “spam”. 

• Regression - attempts to find a function which mod¬ 
els the data with the least error. 

• Summarization - providing a more compact repre¬ 
sentation of the data set, including visualization and 
report generation. 

9.3.3 Results validation 

Data mining can unintentionally be misused, and can then 
produce results which appear to be significant; but which 
do not actually predict future behavior and cannot be 
reproduced on a new sample of data and bear little use. 
Often this results from investigating too many hypotheses 
and not performing proper statistical hypothesis testing. 


A simple version of this problem in machine learning is 
known as overfitting, but the same problem can arise at 
different phases of the process and thus a train/test split 
- when applicable at all - may not be sufficient to prevent 
this from happening. 

The final step of knowledge discovery from data is to ver¬ 
ify that the patterns produced by the data mining algo¬ 
rithms occur in the wider data set. Not all patterns found 
by the data mining algorithms are necessarily valid. It is 
common for the data mining algorithms to find patterns 
in the training set which are not present in the general 
data set. This is called overfitting. To overcome this, the 
evaluation uses a test set of data on which the data min¬ 
ing algorithm was not trained. The learned patterns are 
applied to this test set, and the resulting output is com¬ 
pared to the desired output. For example, a data mining 
algorithm trying to distinguish “spam” from “legitimate” 
emails would be trained on a training set of sample e- 
mails. Once trained, the learned patterns would be ap¬ 
plied to the test set of e-mails on which it had not been 
trained. The accuracy of the patterns can then be mea¬ 
sured from how many e-mails they correctly classify. A 
number of statistical methods may be used to evaluate the 
algorithm, such as ROC curves. 

If the learned patterns do not meet the desired standards, 
subsequently it is necessary to re-evaluate and change the 
pre-processing and data mining steps. If the learned pat¬ 
terns do meet the desired standards, then the final step is 
to interpret the learned patterns and turn them into knowl¬ 
edge. 


9.4 Standards 


There have been some efforts to define standards for 
the data mining process, for example the 1999 Euro¬ 
pean Cross Industry Standard Process for Data Mining 
(CRISP-DM 1.0) and the 2004 Java Data Mining stan¬ 
dard (JDM 1.0). Development on successors to these pro¬ 
cesses (CRISP-DM 2.0 and JDM 2.0) was active in 2006, 
but has stalled since. JDM 2.0 was withdrawn without 
reaching a final draft. 

For exchanging the extracted models - in particular for 
use in predictive analytics - the key standard is the 
Predictive Model Markup Language (PMML), which is 
an XML-based language developed by the Data Min¬ 
ing Group (DMG) and supported as exchange format by 
many data mining applications. As the name suggests, it 
only covers prediction models, a particular data mining 
task of high importance to business applications. How¬ 
ever, extensions to cover (for example) subspace cluster¬ 
ing have been proposed independently of the DMG. 1221 
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9.5 Notable uses 

See also: Category:Applied data mining. 


9.5.1 Games 

Since the early 1960s, with the availability of oracles 
for certain combinatorial games, also called tablebases 
(e.g. for 3x3-chess) with any beginning configuration, 
small-board dots-and-boxes, small-board-hex, and cer¬ 
tain endgames in chess, dots-and-boxes, and hex; a new 
area for data mining has been opened. This is the ex¬ 
traction of human-usable strategies from these oracles. 
Current pattern recognition approaches do not seem to 
fully acquire the high level of abstraction required to be 
applied successfully. Instead, extensive experimentation 
with the tablebases - combined with an intensive study 
of tablebase-answers to well designed problems, and with 
knowledge of prior art (i.e., pre-tablebase knowledge) - 
is used to yield insightful patterns. Berlekamp (in dots- 
and-boxes, etc.) and John Nunn (in chess endgames) are 
notable examples of researchers doing this work, though 
they were not - and are not - involved in tablebase gen¬ 
eration. 

9.5.2 Business 

In business, data mining is the analysis of historical busi¬ 
ness activities, stored as static data in data warehouse 
databases. The goal is to reveal hidden patterns and 
trends. Data mining software uses advanced pattern 
recognition algorithms to sift through large amounts of 
data to assist in discovering previously unknown strate¬ 
gic business information. Examples of what businesses 
use data mining for include performing market analysis 
to identify new product bundles, finding the root cause 
of manufacturing problems, to prevent customer attrition 
and acquire new customers, cross-selling to existing cus¬ 
tomers, and profiling customers with more accuracy. 1231 

• In today’s world raw data is being collected by com¬ 
panies at an exploding rate. For example, Walmart 
processes over 20 million point-of-sale transactions 
every day. This information is stored in a centralized 
database, but would be useless without some type of 
data mining software to analyze it. If Walmart ana¬ 
lyzed their point-of-sale data with data mining tech¬ 
niques they would be able to determine sales trends, 
develop marketing campaigns, and more accurately 
predict customer loyalty. 12411251 

• Every time a credit card or a store loyalty card is 
being used, or a warranty card is being filled, data 
is being collected about the users behavior. Many 
people find the amount of information stored about 


us from companies, such as Google, Facebook, and 
Amazon, disturbing and are concerned about pri¬ 
vacy. Although there is the potential for our per¬ 
sonal data to be used in harmful, or unwanted, ways 
it is also being used to make our lives better. For 
example. Ford and Audi hope to one day collect in¬ 
formation about customer driving patterns so they 
can recommend safer routes and warn drivers about 
dangerous road conditions. 1261 

• Data mining in customer relationship management 
applications can contribute significantly to the bot¬ 
tom line. Rather than randomly contacting a 
prospect or customer through a call center or send¬ 
ing mail, a company can concentrate its efforts on 
prospects that are predicted to have a high likeli¬ 
hood of responding to an offer. More sophisticated 
methods may be used to optimize resources across 
campaigns so that one may predict to which channel 
and to which offer an individual is most likely to re¬ 
spond (across all potential offers). Additionally, so¬ 
phisticated applications could be used to automate 
mailing. Once the results from data mining (po¬ 
tential prospect/customer and channel/offer) are de¬ 
termined, this “sophisticated application” can either 
automatically send an e-mail or a regular mail. Fi¬ 
nally, in cases where many people will take an action 
without an offer, "uplift modeling" can be used to 
determine which people have the greatest increase in 
response if given an offer. Uplift modeling thereby 
enables marketers to focus mailings and offers on 
persuadable people, and not to send offers to peo¬ 
ple who will buy the product without an offer. Data 
clustering can also be used to automatically discover 
the segments or groups within a customer data set. 

• Businesses employing data mining may see a return 
on investment, but also they recognize that the num¬ 
ber of predictive models can quickly become very 
large. For example, rather than using one model to 
predict how many customers will churn, a business 
may choose to build a separate model for each region 
and customer type. In situations where a large num¬ 
ber of models need to be maintained, some busi¬ 
nesses turn to more automated data mining method¬ 
ologies. 

• Data mining can be helpful to human resources 
(HR) departments in identifying the characteristics 
of their most successful employees. Information ob¬ 
tained - such as universities attended by highly suc¬ 
cessful employees - can help HR focus recruiting ef¬ 
forts accordingly. Additionally, Strategic Enterprise 
Management applications help a company trans¬ 
late corporate-level goals, such as profit and margin 
share targets, into operational decisions, such as pro¬ 
duction plans and workforce levels. 1271 
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• Market basket analysis, relates to data-mining use 
in retail sales. If a clothing store records the pur¬ 
chases of customers, a data mining system could 
identify those customers who favor silk shirts over 
cotton ones. Although some explanations of rela¬ 
tionships may be difficult, taking advantage of it 
is easier. The example deals with association rules 
within transaction-based data. Not all data are trans¬ 
action based and logical, or inexact rules may also be 
present within a database. 

• Market basket analysis has been used to identify the 
purchase patterns of the Alpha Consumer. Analyz¬ 
ing the data collected on this type of user has allowed 
companies to predict future buying trends and fore¬ 
cast supply demands. 

• Data mining is a highly effective tool in the catalog 
marketing industry. Catalogers have a rich database 
of history of their customer transactions for millions 
of customers dating back a number of years. Data 
mining tools can identify patterns among customers 
and help identify the most likely customers to re¬ 
spond to upcoming mailing campaigns. 

• Data mining for business applications can be inte¬ 
grated into a complex modeling and decision mak¬ 
ing process. 1281 Reactive business intelligence (RBI) 
advocates a “holistic” approach that integrates data 
mining, modeling, and interactive visualization into 
an end-to-end discovery and continuous innova¬ 
tion process powered by human and automated 
learning. 1291 

• In the area of decision making, the RBI approach 
has been used to mine knowledge that is progres¬ 
sively acquired from the decision maker, and then 
self-tune the decision method accordingly. 1301 The 
relation between the quality of a data mining sys¬ 
tem and the amount of investment that the deci¬ 
sion maker is willing to make was formalized by 
providing an economic perspective on the value 
of “extracted knowledge” in terms of its payoff to 
the organization 1281 This decision-theoretic classi¬ 
fication framework 1281 was applied to a real-world 
semiconductor wafer manufacturing line, where 
decision rules for effectively monitoring and con¬ 
trolling the semiconductor wafer fabrication line 
were developed. 1311 

• An example of data mining related to an integrated- 
circuit (IC) production line is described in the 
paper “Mining IC Test Data to Optimize VLSI 
Testing.” 1321 In this paper, the application of data 
mining and decision analysis to the problem of die- 
level functional testing is described. Experiments 
mentioned demonstrate the ability to apply a system 


of mining historical die-test data to create a proba¬ 
bilistic model of patterns of die failure. These pat¬ 
terns are then utilized to decide, in real time, which 
die to test next and when to stop testing. This system 
has been shown, based on experiments with histori¬ 
cal test data, to have the potential to improve profits 
on mature IC products. Other examples 13311341 of the 
application of data mining methodologies in semi¬ 
conductor manufacturing environments suggest that 
data mining methodologies may be particularly use¬ 
ful when data is scarce, and the various physical and 
chemical parameters that affect the process exhibit 
highly complex interactions. Another implication is 
that on-line monitoring of the semiconductor man¬ 
ufacturing process using data mining may be highly 
effective. 

9.5.3 Science and engineering 

In recent years, data mining has been used widely in the 

areas of science and engineering, such as bioinformatics, 

genetics, medicine, education and electrical power engi¬ 
neering. 

• In the study of human genetics, sequence mining 
helps address the important goal of understand¬ 
ing the mapping relationship between the inter¬ 
individual variations in human DNA sequence and 
the variability in disease susceptibility. In simple 
terms, it aims to find out how the changes in an 
individual’s DNA sequence affects the risks of de¬ 
veloping common diseases such as cancer, which is 
of great importance to improving methods of diag¬ 
nosing, preventing, and treating these diseases. One 
data mining method that is used to perform this task 
is known as multifactor dimensionality reduction. 1351 

• In the area of electrical power engineering, data 
mining methods have been widely used for condition 
monitoring of high voltage electrical equipment. 
The purpose of condition monitoring is to obtain 
valuable information on, for example, the status of 
the insulation (or other important safety-related pa¬ 
rameters). Data clustering techniques - such as the 
self-organizing map (SOM), have been applied to 
vibration monitoring and analysis of transformer on¬ 
load tap-changers (OLTCS). Using vibration mon¬ 
itoring, it can be observed that each tap change 
operation generates a signal that contains informa¬ 
tion about the condition of the tap changer contacts 
and the drive mechanisms. Obviously, different tap 
positions will generate different signals. However, 
there was considerable variability amongst normal 
condition signals for exactly the same tap position. 
SOM has been applied to detect abnormal condi¬ 
tions and to hypothesize about the nature of the 
abnormalities. 1361 
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• Data mining methods have been applied to dissolved 
gas analysis (DGA) in power transformers. DGA, as 
a diagnostics for power transformers, has been avail¬ 
able for many years. Methods such as SOM has been 
applied to analyze generated data and to determine 
trends which are not obvious to the standard DGA 
ratio methods (such as Duval Triangle). [36] 

• In educational research, where data mining has 
been used to study the factors leading students to 
choose to engage in behaviors which reduce their 
learning, 1371 and to understand factors influencing 
university student retention. 1381 A similar exam¬ 
ple of social application of data mining is its use 
in expertise finding systems, whereby descriptors 
of human expertise are extracted, normalized, and 
classified so as to facilitate the finding of experts, 
particularly in scientific and technical fields. In this 
way, data mining can facilitate institutional memory. 

• Data mining methods of biomedical data facili¬ 
tated by domain ontologies, 1391 mining clinical trial 
data, 1401 and traffic analysis using SOM. 1411 

• In adverse drug reaction surveillance, the Uppsala 
Monitoring Centre has, since 1998, used data min¬ 
ing methods to routinely screen for reporting pat¬ 
terns indicative of emerging drug safety issues in 
the WHO global database of 4.6 million suspected 
adverse drug reaction incidents. 1421 Recently, simi¬ 
lar methodology has been developed to mine large 
collections of electronic health records for tempo¬ 
ral patterns associating drug prescriptions to medi¬ 
cal diagnoses. 1431 

• Data mining has been applied to software artifacts 
within the realm of software engineering: Mining 
Software Repositories. 

9.5.4 Human rights 

Data mining of government records - particularly records 
of the justice system (i.e., courts, prisons) - enables the 
discovery of systemic human rights violations in connec¬ 
tion to generation and publication of invalid or fraudulent 
legal records by various government agencies. l 44 l 1 45 1 

9.5.5 Medical data mining 

Some machine learning algorithms can be applied in 
medical field as second-opinion diagnostic tools and as 
tools for the knowledge extraction phase in the process 
of knowledge discovery in databases. One of these classi¬ 
fiers (called Prototype exemplar learning classifier (PEL- 
C) |4fi is able to discover syndromes as well as atypical 
clinical cases. 


In 2011, the case of Sorrell v. IMS Health, Inc., decided 
by the Supreme Court of the United States, ruled that 
pharmacies may share information with outside compa¬ 
nies. This practice was authorized under the 1st Amend¬ 
ment of the Constitution, protecting the “freedom of 
speech.” 1471 However, the passage of the Health Informa¬ 
tion Technology for Economic and Clinical Health Act 
(HITECH Act) helped to initiate the adoption of the elec¬ 
tronic health record (EHR) and supporting technology in 
the United States. [48] The HITECH Act was signed into 
law on February 17, 2009 as part of the American Recov¬ 
ery and Reinvestment Act (ARRA) and helped to open 
the door to medical data mining. 1491 Prior to the signing 
of this law, estimates of only 20% of United States-based 
physicians were utilizing electronic patient records. 148 * 
Spren Brunak notes that “the patient record becomes as 
information-rich as possible” and thereby “maximizes the 
data mining opportunities.” 1481 Hence, electronic patient 
records further expands the possibilities regarding medi¬ 
cal data mining thereby opening the door to a vast source 
of medical data analysis. 

9.5.6 Spatial data mining 

Spatial data mining is the application of data mining 
methods to spatial data. The end objective of spatial data 
mining is to find patterns in data with respect to geog¬ 
raphy. So far, data mining and Geographic Information 
Systems (GIS) have existed as two separate technologies, 
each with its own methods, traditions, and approaches to 
visualization and data analysis. Particularly, most con¬ 
temporary GIS have only very basic spatial analysis func¬ 
tionality. The immense explosion in geographically ref¬ 
erenced data occasioned by developments in IT, digital 
mapping, remote sensing, and the global diffusion of GIS 
emphasizes the importance of developing data-driven in¬ 
ductive approaches to geographical analysis and model¬ 
ing. 

Data mining offers great potential benefits for GIS-based 
applied decision-making. Recently, the task of integrat¬ 
ing these two technologies has become of critical impor¬ 
tance, especially as various public and private sector or¬ 
ganizations possessing huge databases with thematic and 
geographically referenced data begin to realize the huge 
potential of the information contained therein. Among 
those organizations are: 

• offices requiring analysis or dissemination of geo- 
referenced statistical data 

• public health services searching for explanations of 
disease clustering 

• environmental agencies assessing the impact of 
changing land-use patterns on climate change 

• geo-marketing companies doing customer segmen¬ 
tation based on spatial location. 
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Challenges in Spatial mining: Geospatial data reposito¬ 
ries tend to be very large. Moreover, existing GIS datasets 
are often splintered into feature and attribute compo¬ 
nents that are conventionally archived in hybrid data man¬ 
agement systems. Algorithmic requirements differ sub¬ 
stantially for relational (attribute) data management and 
for topological (feature) data management. 1301 Related to 
this is the range and diversity of geographic data for¬ 
mats, which present unique challenges. The digital ge¬ 
ographic data revolution is creating new types of data 
formats beyond the traditional “vector” and “raster” for¬ 
mats. Geographic data repositories increasingly include 
ill-structured data, such as imagery and geo-referenced 
multi-media. 15 [| 

There are several critical research challenges in geo¬ 
graphic knowledge discovery and data mining. Miller and 
Han 152 ' offer the following list of emerging research top¬ 
ics in the field: 

• Developing and supporting geographic data 
warehouses (GDW’s): Spatial properties are often 
reduced to simple aspatial attributes in mainstream 
data warehouses. Creating an integrated GDW re¬ 
quires solving issues of spatial and temporal data in¬ 
teroperability - including differences in semantics, 
referencing systems, geometry, accuracy, and posi¬ 
tion. 

• Better spatio-temporal representations in geo¬ 
graphic knowledge discovery: Current geographic 
knowledge discovery (GKD) methods generally use 
very simple representations of geographic objects 
and spatial relationships. Geographic data min¬ 
ing methods should recognize more complex geo¬ 
graphic objects (i.e., lines and polygons) and rela¬ 
tionships (i.e., non-Euclidean distances, direction, 
connectivity, and interaction through attributed ge¬ 
ographic space such as terrain). Furthermore, the 
time dimension needs to be more fully integrated 
into these geographic representations and relation¬ 
ships. 

• Geographic knowledge discovery using diverse 
data types: GKD methods should be developed 
that can handle diverse data types beyond the tradi¬ 
tional raster and vector models, including imagery 
and geo-referenced multimedia, as well as dynamic 
data types (video streams, animation). 

9.5.7 Temporal data mining 

Data may contain attributes generated and recorded at 
different times. In this case finding meaningful relation¬ 
ships in the data may require considering the temporal 
order of the attributes. A temporal relationship may in¬ 
dicate a causal relationship, or simply an association. 


9.5.8 Sensor data mining 

Wireless sensor networks can be used for facilitating the 
collection of data for spatial data mining for a variety of 
applications such as air pollution monitoring. 1531 A char¬ 
acteristic of such networks is that nearby sensor nodes 
monitoring an environmental feature typically register 
similar values. This kind of data redundancy due to the 
spatial correlation between sensor observations inspires 
the techniques for in-network data aggregation and min¬ 
ing. By measuring the spatial correlation between data 
sampled by different sensors, a wide class of specialized 
algorithms can be developed to develop more efficient 
spatial data mining algorithms. 1541 

9.5.9 Visual data mining 

In the process of turning from analogical into digi¬ 
tal, large data sets have been generated, collected, and 
stored discovering statistical patterns, trends and infor¬ 
mation which is hidden in data, in order to build pre¬ 
dictive patterns. Studies suggest visual data mining is 
faster and much more intuitive than is traditional data 
mining. 15511561157 * See also Computer vision. 

9.5.10 Music data mining 

Data mining techniques, and in particular co-occurrence 
analysis, has been used to discover relevant similarities 
among music corpora (radio lists, CD databases) for pur¬ 
poses including classifying music into genres in a more 
objective manner. 1581 

9.5.11 Surveillance 

Data mining has been used by the U.S. government. Pro¬ 
grams include the Total Information Awareness (TIA) 
program. Secure Flight (formerly known as Computer- 
Assisted Passenger Prescreening System (CAPPS II)), 
Analysis, Dissemination, Visualization, Insight, Seman¬ 
tic Enhancement (ADVISE),* 591 and the Multi-state Anti- 
Terrorism Information Exchange (MATRIX).* 601 These 
programs have been discontinued due to controversy over 
whether they violate the 4th Amendment to the United 
States Constitution, although many programs that were 
formed under them continue to be funded by different 
organizations or under different names.* 61 * 

In the context of combating terrorism, two particularly 
plausible methods of data mining are “pattern mining” 
and “subject-based data mining”. 

9.5.12 Pattern mining 

“Pattern mining” is a data mining method that involves 
finding existing patterns in data. In this context patterns 
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often means association rules. The original motivation 
for searching association rules came from the desire to 
analyze supermarket transaction data, that is, to examine 
customer behavior in terms of the purchased products. 
For example, an association rule “beer => potato chips 
(80%)" states that four out of five customers that bought 
beer also bought potato chips. 

In the context of pattern mining as a tool to identify 
terrorist activity, the National Research Council pro¬ 
vides the following definition: “Pattern-based data min¬ 
ing looks for patterns (including anomalous data patterns) 
that might be associated with terrorist activity — these 
patterns might be regarded as small signals in a large 
ocean of noise.”' 62 " 63 " 64 ' Pattern Mining includes new 
areas such a Music Information Retrieval (MIR) where 
patterns seen both in the temporal and non temporal 
domains are imported to classical knowledge discovery 
search methods. 

9.5.13 Subject-based data mining 

“Subject-based data mining” is a data mining method 
involving the search for associations between individu¬ 
als in data. In the context of combating terrorism, the 
National Research Council provides the following defi¬ 
nition: “Subject-based data mining uses an initiating in¬ 
dividual or other datum that is considered, based on other 
information, to be of high interest, and the goal is to de¬ 
termine what other persons or financial transactions or 
movements, etc., are related to that initiating datum.”' 63 ' 

9.5.14 Knowledge grid 

Knowledge discovery “On the Grid” generally refers to 
conducting knowledge discovery in an open environment 
using grid computing concepts, allowing users to inte¬ 
grate data from various online data sources, as well make 
use of remote resources, for executing their data mining 
tasks. The earliest example was the Discovery Net, 1651166 ' 
developed at Imperial College London, which won the 
“Most Innovative Data-Intensive Application Award” at 
the ACM SC02 (Supercomputing 2002) conference and 
exhibition, based on a demonstration of a fully interactive 
distributed knowledge discovery application for a bioin¬ 
formatics application. Other examples include work con¬ 
ducted by researchers at the University of Calabria, who 
developed a Knowledge Grid architecture for distributed 
knowledge discovery, based on grid computing. 16711681 

9.6 Privacy concerns and ethics 

While the term “data mining” itself has no ethical im¬ 
plications, it is often associated with the mining of in¬ 
formation in relation to peoples’ behavior (ethical and 
otherwise). 1691 


The ways in which data mining can be used can in some 
cases and contexts raise questions regarding privacy, le¬ 
gality, and ethics. 170 In particular, data mining govern¬ 
ment or commercial data sets for national security or law 
enforcement purposes, such as in the Total Information 
Awareness Program or in ADVISE, has raised privacy 
concerns.' 71 " 72 ' 

Data mining requires data preparation which can uncover 
information or patterns which may compromise confiden¬ 
tiality and privacy obligations. A common way for this 
to occur is through data aggregation. Data aggregation 
involves combining data together (possibly from various 
sources) in a way that facilitates analysis (but that also 
might make identification of private, individual-level data 
deducible or otherwise apparent). 1731 This is not data min¬ 
ing per se, but a result of the preparation of data before 
- and for the purposes of - the analysis. The threat to an 
individual’s privacy comes into play when the data, once 
compiled, cause the data miner, or anyone who has access 
to the newly compiled data set, to be able to identify spe¬ 
cific individuals, especially when the data were originally 
anonymous .I 74 H 75 11 76 ! 

It is recommended that an individual is made aware of the 
following before data are collected: 173 ' 

• the purpose of the data collection and any (known) 
data mining projects; 

• how the data will be used; 

• who will be able to mine the data and use the data 
and their derivatives; 

• the status of security surrounding access to the data; 

• how collected data can be updated. 

Data may also be modified so as to become anonymous, 
so that individuals may not readily be identified. 173 ' How¬ 
ever, even “de-identified"/"anonymized” data sets can po¬ 
tentially contain enough information to allow identifica¬ 
tion of individuals, as occurred when journalists were 
able to find several individuals based on a set of search 
histories that were inadvertently released by AOL.' 77 ' 

9.6.1 Situation in Europe 

Europe has rather strong privacy laws, and efforts are un¬ 
derway to further strengthen the rights of the consumers. 
However, the U.S.-E.U. Safe Harbor Principles currently 
effectively expose European users to privacy exploitation 
by U.S. companies. As a consequence of Edward Snow¬ 
den's Global surveillance disclosure, there has been in¬ 
creased discussion to revoke this agreement, as in partic¬ 
ular the data will be fully exposed to the National Security 
Agency, and attempts to reach an agreement have failed. 
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9.6.2 Situation in the United States 

In the United States, privacy concerns have been ad¬ 
dressed by the US Congress via the passage of regulatory 
controls such as the Health Insurance Portability and Ac¬ 
countability Act (HIPAA). The HIP A A requires individ¬ 
uals to give their “informed consent” regarding informa¬ 
tion they provide and its intended present and future uses. 
According to an article in Biotech Business Week', "'[i]n 
practice, HIPAA may not offer any greater protection than 
the longstanding regulations in the research arena,' says 
the AAHC. More importantly, the rule’s goal of protection 
through informed consent is undermined by the complexity 
of consent forms that are required of patients and partic¬ 
ipants, which approach a level of incomprehensibility to 
average individuals. ’This underscores the necessity for 
data anonymity in data aggregation and mining practices. 

U.S. information privacy legislation such as HIPAA and 
the Family Educational Rights and Privacy Act (FERPA) 
applies only to the specific areas that each such law ad¬ 
dresses. Use of data mining by the majority of businesses 
in the U.S. is not controlled by any legislation. 


9.7 Copyright Law 

9.7.1 Situation in Europe 

Due to a lack of flexibilities in European copyright and 
database law, the mining of in-copyright works such 
as web mining without the permission of the copyright 
owner is not legal. Where a database is pure data in Eu¬ 
rope there is likely to be no copyright, but database rights 
may exist so data mining becomes subject to regulations 
by the Database Directive. On the recommendation of 
the Hargreaves review this led to the UK government to 
amend its copyright law in 2014 1791 to allow content min¬ 
ing as a limitation and exception. Only the second coun¬ 
try in the world to do so after Japan, which introduced an 
exception in 2009 for data mining. However due to the 
restriction of the Copyright Directive, the UK exception 
only allows content mining for non-commercial purposes. 
UK copyright law also does not allow this provision to 
be overridden by contractual terms and conditions. The 
European Commission facilitated stakeholder discussion 
on text and data mining in 2013, under the title of Li¬ 
cences for Europe. 1801 The focus on the solution to this 
legal issue being licences and not limitations and excep¬ 
tions led to representatives of universities, researchers, 
libraries, civil society groups and open access publishers 
to leave the stakeholder dialogue in May 2013. 1811 

9.7.2 Situation in the United States 

By contrast to Europe, the flexible nature of US copyright 
law, and in particular fair use means that content mining 


in America, as well as other fair use countries such as Is¬ 
rael, Taiwan and South Korea is viewed as being legal. As 
content mining is transformative, that is it does not sup¬ 
plant the original work, it is viewed as being lawful under 
fair use. For example as part of the Google Book settle¬ 
ment the presiding judge on the case ruled that Google’s 
digitisation project of in-copyright books was lawful, in 
part because of the transformative uses that the digitisa¬ 
tion project displayed - one being text and data mining. 1821 


9.8 Software 

See also: CategoryiData mining and machine learning 
software. 


9.8.1 Free open-source data mining soft¬ 
ware and applications 

• Carrot2: Text and search results clustering frame¬ 
work. 

• Chemicalize.org: A chemical structure miner and 
web search engine. 

• ELKI: A university research project with advanced 
cluster analysis and outlier detection methods writ¬ 
ten in the Java language. 

• GATE: a natural language processing and language 
engineering tool. 

• KNIME: The Konstanz Information Miner, a user 
friendly and comprehensive data analytics frame¬ 
work. 

• ML-Flex: A software package that enables users 
to integrate with third-party machine-learning pack¬ 
ages written in any programming language, exe¬ 
cute classification analyses in parallel across multi¬ 
ple computing nodes, and produce HTML reports 
of classification results. 

• MLPACK library: a collection of ready-to-use ma¬ 
chine learning algorithms written in the C++ lan¬ 
guage. 

• Massive Online Analysis (MOA): a real-time big 
data stream mining with concept drift tool in the 
Java programming language. 

• NLTK (Natural Language Toolkit): A suite of li¬ 
braries and programs for symbolic and statistical 
natural language processing (NLP) for the Python 
language. 

• OpenNN: Open neural networks library. 
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• Orange: A component-based data mining and 
machine learning software suite written in the 
Python language. 

• R: A programming language and software environ¬ 
ment for statistical computing, data mining, and 
graphics. It is part of the GNU Project. 

• SCaViS: Java cross-platform data analysis frame¬ 
work developed at Argonne National Laboratory. 

• SenticNet API: A semantic and affective resource 
for opinion mining and sentiment analysis. 

• Tanagra: A visualisation-oriented data mining soft¬ 
ware, also for teaching. 

• Torch: An open source deep learning library for the 
Lua programming language and scientific comput¬ 
ing framework with wide support for machine learn¬ 
ing algorithms. 

• UIMA: The UIMA (Unstructured Information 
Management Architecture) is a component frame¬ 
work for analyzing unstructured content such as text, 
audio and video - originally developed by IBM. 

• Weka: A suite of machine learning software appli¬ 
cations written in the Java programming language. 

9.8.2 Commercial data-mining software 
and applications 

• Angoss KnowledgeSTUDIO: data mining tool pro¬ 
vided by Angoss. 

• Clarabridge: enterprise class text analytics solution. 

• HP Vertica Analytics Platform: data mining soft¬ 
ware provided by HP. 

• IBM SPSS Modeler: data mining software provided 
by IBM. 

• KXEN Modeler: data mining tool provided by 
KXEN. 

• Grapheme: data mining and visualization software 
provided by iChrome. 

• LIONsolver: an integrated software application for 
data mining, business intelligence, and modeling 
that implements the Learning and Intelligent Opti- 
mizatioN (LION) approach. 

• Microsoft Analysis Services: data mining software 
provided by Microsoft. 

• NetOwl: suite of multilingual text and entity analyt¬ 
ics products that enable data mining. 

• Oracle Data Mining: data mining software by 
Oracle. 


• RapidMiner: An environment for machine learning 
and data mining experiments. 

• SAS Enterprise Miner: data mining software pro¬ 
vided by the SAS Institute. 

• STATISTICA Data Miner: data mining software 
provided by StatSoft. 

• Qlucore Omics Explorer: data mining software pro¬ 
vided by Qlucore. 

9.8.3 Marketplace surveys 

Several researchers and organizations have conducted re¬ 
views of data mining tools and surveys of data miners. 
These identify some of the strengths and weaknesses of 
the software packages. They also provide an overview 
of the behaviors, preferences and views of data miners. 
Some of these reports include: 

• 2011 Wiley Interdisciplinary Reviews: Data Mining 
and Knowledge Discovery 1831 

• Rexer Analytics Data Miner Surveys (2007- 
2013) [841 

• Forrester Research 2010 Predictive Analytics and 
Data Mining Solutions report 1851 

• Gartner 2008 “Magic Quadrant” report 1861 

• Robert A. Nisbet’s 2006 Three Part Series of arti¬ 
cles “Data Mining Tools: Which One is Best For 
CRM?" [87] 

• Haughton et al.'s 2003 Review of Data Mining Soft¬ 
ware Packages in The American Statistician 1881 

• Goebel & Gruenwald 1999 “A Survey of Data 
Mining a Knowledge Discovery Software Tools” in 
SIGKDD Explorations 1891 


9.9 See also 

Methods 

• Anomaly/outlier/change detection 

• Association rule learning 

• Classification 

• Cluster analysis 

• Decision tree 

• Factor analysis 

• Genetic algorithms 
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• Intention mining 

• Multilinear subspace learning 

• Neural networks 

• Regression analysis 

• Sequence mining 

• Structured data analysis 

• Support vector machines 

• Text mining 

• Online analytical processing (OLAP) 

Application domains 

• Analytics 

• Bioinformatics 

• Business intelligence 

• Data analysis 

• Data warehouse 

• Decision support system 

• Drug discovery 

• Exploratory data analysis 

• Predictive analytics 

• Web mining 

Application examples 

See also: Category:Applied data mining. 

• Customer analytics 

• Data mining in agriculture 

• Data mining in meteorology 

• Educational data mining 

• National Security Agency 

• Police-enforced ANPR in the UK 

• Quantitative structure-activity relationship 

• Surveillance / Mass surveillance (e.g.. Stellar Wind) 

Related topics 

Data mining is about analyzing data; for information 
about extracting information out of data, see: 


• Data integration 

• Data transformation 

• Electronic discovery 

• Information extraction 

• Information integration 

• Named-entity recognition 

• Profiling (information science) 

• Web scraping 
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Chapter 10 

Big data 


This article is about large collections of data. For the 
graph database, see Graph database. For the band, see 
Big Data (band). 

Big data is a broad term for data sets so large or com- 



Visualization of daily Wikipedia edits created by IBM. At multiple 
terabytes in size, the text and images of Wikipedia are an example 
of big data. 



Growth of and Digitization of Global Information Storage Ca¬ 
pacity Source 

plex that traditional data processing applications are in¬ 
adequate. Challenges include analysis, capture, data cu- 
ration, search, sharing, storage, transfer, visualization, 
and information privacy. The term often refers simply to 


the use of predictive analytics or other certain advanced 
methods to extract value from data, and seldom to a par¬ 
ticular size of data set. Accuracy in big data may lead 
to more confident decision making. And better decisions 
can mean greater operational efficiency, cost reductions 
and reduced risk. 

Analysis of data sets can find new correlations, to “spot 
business trends, prevent diseases, combat crime and so 
on.” 1 11 Scientists, business executives, practitioners of 
media and advertising and governments alike regularly 
meet difficulties with large data sets in areas including 
Internet search, finance and business informatics. Sci¬ 
entists encounter limitations in e-Science work, includ¬ 
ing meteorology, genomics,' 2 ' connectomics, complex 
physics simulations, 131 and biological and environmental 
research. 141 

Data sets grow in size in part because they are increas¬ 
ingly being gathered by cheap and numerous information¬ 
sensing mobile devices, aerial (remote sensing), software 
logs, cameras, microphones, radio-frequency identifica¬ 
tion (RFID) readers, and wireless sensor networks.' 5 " 6 " 7 ' 
The world’s technological per-capita capacity to store in¬ 
formation has roughly doubled every 40 months since the 
1980s;' 8 ' as of 2012, every day 2.5 exabytes (2.5xl0 18 ) 
of data were created; 19 ' The challenge for large enter¬ 
prises is determining who should own big data initiatives 
that straddle the entire organization. 1101 

Work with big data is necessarily uncommon; most anal¬ 
ysis is of “PC size” data, on a desktop PC or notebook 1 11 ' 
that can handle the available data set. 

Relational database management systems and desktop 
statistics and visualization packages often have difficulty 
handling big data. The work instead requires “massively 
parallel software running on tens, hundreds, or even thou¬ 
sands of servers”.' 121 What is considered “big data” varies 
depending on the capabilities of the users and their tools, 
and expanding capabilities make Big Data a moving tar¬ 
get. Thus, what is considered to be “Big” in one year will 
become ordinary in later years. “For some organizations, 
facing hundreds of gigabytes of data for the first time may 
trigger a need to reconsider data management options. 
For others, it may take tens or hundreds of terabytes be¬ 
fore data size becomes a significant consideration.”' 13 ' 
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CHAPTER 10. BIG DATA 


10.1 Definition 

Big data usually includes data sets with sizes beyond 
the ability of commonly used software tools to capture, 
curate, manage, and process data within a tolerable 
elapsed time. 1141 Big data “size” is a constantly moving 
target, as of 2012 ranging from a few dozen terabytes to 
many petabytes of data. Big data is a set of techniques 
and technologies that require new forms of integration to 
uncover large hidden values from large datasets that are 
diverse, complex, and of a massive scaled 151 

In a 2001 research report 1 161 and related lectures, META 
Group (now Gartner) analyst Doug Laney defined data 
growth challenges and opportunities as being three- 
dimensional, i.e. increasing volume (amount of data), 
velocity (speed of data in and out), and variety (range of 
data types and sources). Gartner, and now much of the 
industry, continue to use this “3Vs” model for describing 
big data. 1171 In 2012, Gartner updated its definition as fol¬ 
lows: “Big data is high volume, high velocity, and/or high 
variety information assets that require new forms of pro¬ 
cessing to enable enhanced decision making, insight dis¬ 
covery and process optimization.” 1481 Additionally, a new 
V “Veracity” is added by some organizations to describe 
it. 1191 

If Gartner’s definition (the 3Vs) is still widely used, the 
growing maturity of the concept fosters a more sound dif¬ 
ference between big data and Business Intelligence, re¬ 
garding data and their use: 1201 

• Business Intelligence uses descriptive statistics with 
data with high information density to measure 
things, detect trends etc.; 

• Big data uses inductive statistics and concepts from 
nonlinear system identification 1211 to infer laws (re¬ 
gressions, nonlinear relationships, and causal ef¬ 
fects) from large sets of data with low infor¬ 
mation density 1221 to reveal relationships, depen¬ 
dencies and perform predictions of outcomes and 
behaviors. 12111231 

A more recent, consensual definition states that “Big Data 
represents the Information assets characterized by such 
a High Volume, Velocity and Variety to require specific 
Technology and Analytical Methods for its transforma¬ 
tion into Value”. 1241 


10.2 Characteristics 

Big data can be described by the following characteristics: 

Volume - The quantity of data that is generated is very 
important in this context. It is the size of the data which 
determines the value and potential of the data under con¬ 
sideration and whether it can actually be considered Big 


Data or not. The name ‘Big Data’ itself contains a term 
which is related to size and hence the characteristic. 

Variety - The next aspect of Big Data is its variety. This 
means that the category to which Big Data belongs to is 
also an essential fact that needs to be known by the data 
analysts. This helps the people, who are closely analyzing 
the data and are associated with it, to effectively use the 
data to their advantage and thus upholding the importance 
of the Big Data. 

Velocity - The term ‘velocity’ in the context refers to the 
speed of generation of data or how fast the data is gen¬ 
erated and processed to meet the demands and the chal¬ 
lenges which lie ahead in the path of growth and devel¬ 
opment. 

Variability - This is a factor which can be a problem for 
those who analyse the data. This refers to the inconsis¬ 
tency which can be shown by the data at times, thus ham¬ 
pering the process of being able to handle and manage 
the data effectively. 

Veracity - The quality of the data being captured can vary 
greatly. Accuracy of analysis depends on the veracity of 
the source data. 

Complexity - Data management can become a very com¬ 
plex process, especially when large volumes of data come 
from multiple sources. These data need to be linked, con¬ 
nected and correlated in order to be able to grasp the in¬ 
formation that is supposed to be conveyed by these data. 
This situation, is therefore, termed as the ‘complexity’ of 
Big Data. 

Factory work and Cyber-physical systems may have a 6C 
system: 

1. Connection (sensor and networks), 

2. Cloud (computing and data on demand), 

3. Cyber (model and memory), 

4. content/context (meaning and correlation), 

5. community (sharing and collaboration), and 

6. customization (personalization and value). 

In this scenario and in order to provide useful insight to 
the factory management and gain correct content, data 
has to be processed with advanced tools (analytics and 
algorithms) to generate meaningful information. Consid¬ 
ering the presence of visible and invisible issues in an in¬ 
dustrial factory, the information generation algorithm has 
to be capable of detecting and addressing invisible issues 
such as machine degradation, component wear, etc. in 
the factory floor. 12511261 


10.4. TECHNOLOGIES 


93 


10.3 Architecture 

In 2000, Seisint Inc. developed C++ based distributed file 
sharing framework for data storage and querying. Struc¬ 
tured, semi-structured and/or unstructured data is stored 
and distributed across multiple servers. Querying of data 
is done by modified C++ called ECL which uses ap¬ 
ply scheme on read method to create structure of stored 
data during time of query. In 2004 LexisNexis acquired 
Seisint Inc. 1271 and 2008 acquired ChoicePoint, Inc. 1281 
and their high speed parallel processing platform. The 
two platforms were merged into HPCC Systems and in 
2011 was open sourced under Apache v2.0 License. Cur¬ 
rently HPCC and Quantcast File System 1291 are the only 
publicly available platforms capable of analyzing multiple 
exabytes of data. 

In 2004, Google published a paper on a process called 
MapReduce that used such an architecture. The MapRe¬ 
duce framework provides a parallel processing model and 
associated implementation to process huge amounts of 
data. With MapReduce, queries are split and distributed 
across parallel nodes and processed in parallel (the Map 
step). The results are then gathered and delivered (the 
Reduce step). The framework was very successful, 1301 
so others wanted to replicate the algorithm. There¬ 
fore, an implementation of the MapReduce framework 
was adopted by an Apache open source project named 
Hadoop. 1311 

MIKE2.0 is an open approach to information manage¬ 
ment that acknowledges the need for revisions due to 
big data implications in an article titled “Big Data Solu¬ 
tion Offering”. 1321 The methodology addresses handling 
big data in terms of useful permutations of data sources, 
complexity in interrelationships, and difficulty in deleting 
(or modifying) individual records. 1331 

Recent studies show that the use of a multiple layer ar¬ 
chitecture is an option for dealing with big data. The Dis¬ 
tributed Parallel architecture distributes data across mul¬ 
tiple processing units and parallel processing units pro¬ 
vide data much faster, by improving processing speeds. 
This type of architecture inserts data into a parallel 
DBMS, which implements the use of MapReduce and 
Hadoop frameworks. This type of framework looks to 
make the processing power transparent to the end user by 
using a front end application server. 1341 

Big Data Analytics for Manufacturing Applications can 
be based on a 5C architecture (connection, conversion, 
cyber, cognition, and configuration). 1351 

Big Data Lake - With the changing face of business and 
IT sector, capturing and storage of data has emerged into 
a sophisticated system. The big data lake allows an or¬ 
ganization to shift its focus from centralized control to a 
shared model to respond to the changing dynamics of in¬ 
formation management. This enables quick segregation 
of data into the data lake thereby reducing the overhead 
time. 1361 


10.4 Technologies 

Big data requires exceptional technologies to efficiently 
process large quantities of data within tolerable elapsed 
times. A 2011 McKinsey report 1371 suggests suit¬ 
able technologies include A/B testing, crowdsourcing, 
data fusion and integration, genetic algorithms, machine 
learning, natural language processing, signal processing, 
simulation, time series analysis and visualisation. Multi¬ 
dimensional big data can also be represented as tensors, 
which can be more efficiently handled by tensor-based 
computation, 1381 such as multilinear subspace learning. 1391 
Additional technologies being applied to big data include 
massively parallel-processing (MPP) databases, search- 
based applications, data mining, distributed file systems, 
distributed databases, cloud based infrastructure (appli¬ 
cations, storage and computing resources) and the Inter¬ 
net. 

Some but not all MPP relational databases have the ability 
to store and manage petabytes of data. Implicit is the 
ability to load, monitor, back up, and optimize the use of 
the large data tables in the RDBMS. 1401 

DARPA’s Topological Data Analysis program seeks the 
fundamental structure of massive data sets and in 2008 
the technology went public with the launch of a company 
called Ayasdi. 1411 

The practitioners of big data analytics processes are 
generally hostile to slower shared storage, 1421 preferring 
direct-attached storage (DAS) in its various forms from 
solid state drive (SSD) to high capacity SATA disk 
buried inside parallel processing nodes. The perception 
of shared storage architectures— Storage area network 
(SAN) and Network-attached storage (NAS) —is that 
they are relatively slow, complex, and expensive. These 
qualities are not consistent with big data analytics sys¬ 
tems that thrive on system performance, commodity in¬ 
frastructure, and low cost. 

Real or near-real time information delivery is one of the 
defining characteristics of big data analytics. Latency is 
therefore avoided whenever and wherever possible. Data 
in memory is good—data on spinning disk at the other 
end of a FC SAN connection is not. The cost of a SAN 
at the scale needed for analytics applications is very much 
higher than other storage techniques. 

There are advantages as well as disadvantages to shared 
storage in big data analytics, but big data analytics prac¬ 
titioners as of 2011 did not favour it. 1431 


10.5 Applications 

Big data has increased the demand of information man¬ 
agement specialists in that Software AG, Oracle Corpo¬ 
ration, IBM, Microsoft, SAP, EMC, HP and Dell have 
spent more than $15 billion on software firms specializing 
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in data management and analytics. In 2010, this industry 
was worth more than $100 billion and was growing at al¬ 
most 10 percent a year: about twice as fast as the software 
business as a whole. 1 ' 1 

Developed economies make increasing use of data- 
intensive technologies. There are 4.6 billion mobile- 
phone subscriptions worldwide and between 1 billion and 
2 billion people accessing the internet. 1 ' 1 Between 1990 
and 2005, more than 1 billion people worldwide entered 
the middle class which means more and more people who 
gain money will become more literate which in turn leads 
to information growth. The world’s effective capacity 
to exchange information through telecommunication net¬ 
works was 281 petabytes in 1986, 471 petabytes in 1993, 
2.2 exabytes in 2000, 65 exabytes in 2007 181 and it is pre¬ 
dicted that the amount of traffic flowing over the inter¬ 
net will reach 667 exabytes annually by 2014. 111 It is esti¬ 
mated that one third of the globally stored information is 
in the form of alphanumeric text and still image data, 144 ' 
which is the format most useful for most big data appli¬ 
cations. This also shows the potential of yet unused data 
(i.e. in the form of video and audio content). 

While many vendors offer off-the-shelf solutions for 
Big Data, experts recommend the development of in- 
house solutions custom-tailored to solve the company’s 
problem at hand if the company has sufficient technical 
capabilities. 1451 


10.5.1 Government 

The use and adoption of Big Data within governmental 
processes is beneficial and allows efficiencies in terms of 
cost, productivity, and innovation. That said, this pro¬ 
cess does not come without its flaws. Data analysis often 
requires multiple parts of government (central and local) 
to work in collaboration and create new and innovative 
processes to deliver the desired outcome. Below are the 
thought leading examples within the Governmental Big 
Data space. 


United States of America 

• In 2012, the Obama administration announced the 
Big Data Research and Development Initiative, to 
explore how big data could be used to address im¬ 
portant problems faced by the government. 1461 The 
initiative is composed of 84 different big data pro¬ 
grams spread across six departments. 1471 

• Big data analysis played a large role in Barack 
Obama's successful 2012 re-election campaign. 1481 

• The United States Federal Government owns six 
of the ten most powerful supercomputers in the 
world. 1491 

• The Utah Data Center is a data center currently be¬ 
ing constructed by the United States National Se¬ 
curity Agency. When finished, the facility will be 
able to handle a large amount of information col¬ 
lected by the NSA over the Internet. The exact 
amount of storage space is unknown, but more re¬ 
cent sources claim it will be on the order of a few 
exabytes. 1501 15111521 

India 

• Big data analysis was, in parts, responsible for the 
BJP and its allies to win a highly successful Indian 
General Election 2014. 1531 

• The Indian Government utilises numerous tech¬ 
niques to ascertain how the Indian electorate is re¬ 
sponding to government action, as well as ideas for 
policy augmentation 

United Kingdom 

Examples of uses of big data in public services: 

• Data on prescription drugs: by connecting origin, lo¬ 
cation and the time of each prescription, a research 
unit was able to exemplify the considerable delay 
between the release of any given drug, and a UK¬ 
wide adaptation of the National Institute for Health 
and Care Excellence guidelines. This suggests that 
new/most up-to-date drugs take some time to filter 
through to the general patient. 

• Joining up data: a local authority blended data about 
services, such as road gritting rotas, with services for 
people at risk, such as 'meals on wheels’. The con¬ 
nection of data allowed the local authority to avoid 
any weather related delay. 

10.5.2 International development 

Research on the effective usage of information and com¬ 
munication technologies for development (also known as 
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ICT4D) suggests that big data technology can make im¬ 
portant contributions but also present unique challenges 
to International development. 15411531 Advancements in 
big data analysis offer cost-effective opportunities to 
improve decision-making in critical development areas 
such as health care, employment, economic productiv¬ 
ity, crime, security, and natural disaster and resource 
management. 156115711581 However, longstanding challenges 
for developing regions such as inadequate technolog¬ 
ical infrastructure and economic and human resource 
scarcity exacerbate existing concerns with big data such 
as privacy, imperfect methodology, and interoperability 
issues. 1561 


10.5.3 Manufacturing 

Based on TCS 2013 Global Trend Study, improvements 
in supply planning and product quality provide the great¬ 
est benefit of big data for manufacturing. 1591 Big data pro¬ 
vides an infrastructure for transparency in manufacturing 
industry, which is the ability to unravel uncertainties such 
as inconsistent component performance and availability. 
Predictive manufacturing as an applicable approach to¬ 
ward near-zero downtime and transparency requires vast 
amount of data and advanced prediction tools for a sys¬ 
tematic process of data into useful information. 1601 A con¬ 
ceptual framework of predictive manufacturing begins 
with data acquisition where different type of sensory data 
is available to acquire such as acoustics, vibration, pres¬ 
sure, current, voltage and controller data. Vast amount of 
sensory data in addition to historical data construct the 
big data in manufacturing. The generated big data acts 
as the input into predictive tools and preventive strategies 
such as Prognostics and Health Management (PHM). 1611 

Cyber-Physical Models 

Current PHM implementations mostly utilize data dur¬ 
ing the actual usage while analytical algorithms can per¬ 
form more accurately when more information through¬ 
out the machine’s lifecycle, such as system configuration, 
physical knowledge and working principles, are included. 
There is a need to systematically integrate, manage and 
analyze machinery or process data during different stages 
of machine life cycle to handle data/information more ef¬ 
ficiently and further achieve better transparency of ma¬ 
chine health condition for manufacturing industry. 

With such motivation a cyber-physical (coupled) model 
scheme has been developed. Please see http:// 
www.imscenter.net/cyber-physical-platform The cou¬ 
pled model is a digital twin of the real machine that oper¬ 
ates in the cloud platform and simulates the health condi¬ 
tion with an integrated knowledge from both data driven 
analytical algorithms as well as other available physical 
knowledge. It can also be described as a 5S systematic ap¬ 
proach consisting of Sensing, Storage, Synchronization, 


Synthesis and Service. The coupled model first constructs 
a digital image from the early design stage. System infor¬ 
mation and physical knowledge are logged during prod¬ 
uct design, based on which a simulation model is built 
as a reference for future analysis. Initial parameters may 
be statistically generalized and they can be tuned using 
data from testing or the manufacturing process using pa¬ 
rameter estimation. After which, the simulation model 
can be considered as a mirrored image of the real ma¬ 
chine, which is able to continuously record and track ma¬ 
chine condition during the later utilization stage. Finally, 
with ubiquitous connectivity offered by cloud computing 
technology, the coupled model also provides better ac¬ 
cessibility of machine condition for factory managers in 
cases where physical access to actual equipment or ma¬ 
chine data is limited. 12611621 


10.5.4 Media 

Internet of Things (IoT) 

Main article: Internet of Things 

To understand how the media utilises Big Data, it is first 
necessary to provide some context into the mechanism 
used for media process. It has been suggested by Nick 
Couldry and Joseph Turow that practitioners in Media 
and Advertising approach big data as many actionable 
points of information about millions of individuals. The 
industry appears to be moving away from the traditional 
approach of using specific media environments such as 
newspapers, magazines, or television shows and instead 
tap into consumers with technologies that reach targeted 
people at optimal times in optimal locations. The ulti¬ 
mate aim is to serve, or convey, a message or content that 
is (statistically speaking) in line with the consumers mind¬ 
set. For example, publishing environments are increas¬ 
ingly tailoring messages (advertisements) and content (ar¬ 
ticles) to appeal to consumers that have been exclusively 
gleaned through various data-mining activities. 1631 

• Targeting of consumers (for advertising by mar¬ 
keters) 

• Data-capture 

Big Data and the IoT work in conjunction. From a media 
perspective, data is the key derivative of device inter con¬ 
nectivity and allows accurate targeting. The Internet of 
Things, with the help of big data, therefore transforms the 
media industry, companies and even governments, open¬ 
ing up a new era of economic growth and competitive¬ 
ness. The intersection of people, data and intelligent al¬ 
gorithms have far-reaching impacts on media efficiency. 
The wealth of data generated allows an elaborate layer on 
the present targeting mechanisms of the industry. 
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• eBay.com uses two data warehouses at 7.5 petabytes 
and 40PB as well as a 40PB Hadoop cluster for 
search, consumer recommendations, and merchan¬ 
dising. Inside eBay’s 90PB data warehouse 

• Amazon.com handles millions of back-end opera¬ 
tions every day, as well as queries from more than 
half a million third-party sellers. The core technol¬ 
ogy that keeps Amazon running is Linux-based and 
as of 2005 they had the world’s three largest Linux 
databases, with capacities of 7.8 TB, 18.5 TB, and 
24.7 TB. [641 

• Facebook handles 50 billion photos from its user 
base. 1651 

• As of August 2012, Google was handling roughly 
100 billion searches per month. 1661 

• Oracle NoSQL Database has been tested to past the 
1M ops/sec mark with 8 shards and proceeded to hit 
1.2M ops/sec with 10 shards. 1671 

10.5.5 Private sector 

Retail 

• Walmart handles more than 1 million customer 
transactions every hour, which are imported into 
databases estimated to contain more than 2.5 
petabytes (2560 terabytes) of data - the equivalent 
of 167 times the information contained in all the 
books in the US Library of Congress. 111 

Retail Banking 

• FICO Card Detection System protects accounts 
world-wide. 1681 

• The volume of business data worldwide, across all 
companies, doubles every 1.2 years, according to 
estimates. 16911 ™ 1 

Real Estate 

• Windermere Real Estate uses anonymous GPS sig¬ 
nals from nearly 100 million drivers to help new 
home buyers determine their typical drive times 
to and from work throughout various times of the 
day. 1711 

10.5.6 Science 

The Large Hadron Collider experiments represent about 

150 million sensors delivering data 40 million times per 


second. There are nearly 600 million collisions per sec¬ 
ond. After filtering and refraining from recording more 
than 99.99995% 1721 of these streams, there are 100 col¬ 
lisions of interest per second. 173117411751 

• As a result, only working with less than 0.001% of 
the sensor stream data, the data flow from all four 
LHC experiments represents 25 petabytes annual 
rate before replication (as of 2012). This becomes 
nearly 200 petabytes after replication. 

• If all sensor data were to be recorded in LHC, the 
data flow would be extremely hard to work with. The 
data flow would exceed 150 million petabytes annual 
rate, or nearly 500 exabytes per day, before replica¬ 
tion. To put the number in perspective, this is equiv¬ 
alent to 500 quintillion (5xl0 20 ) bytes per day, al¬ 
most 200 times more than all the other sources com¬ 
bined in the world. 

The Square Kilometre Array is a telescope which consists 
of millions of antennas and is expected to be operational 
by 2024. Collectively, these antennas are expected to 
gather 14 exabytes and store one petabyte per day. 17611771 
It is considered to be one of the most ambitious scientific 
projects ever undertaken. 

Science and Research 

• When the Sloan Digital Sky Survey (SDSS) began 
collecting astronomical data in 2000, it amassed 
more in its first few weeks than all data collected 
in the history of astronomy. Continuing at a rate 
of about 200 GB per night, SDSS has amassed 
more than 140 terabytes of information. When 
the Large Synoptic Survey Telescope, successor to 
SDSS, comes online in 2016 it is anticipated to ac¬ 
quire that amount of data every five days. 1 ' 1 

• Decoding the human genome originally took 10 
years to process, now it can be achieved in less than a 
day: the DNA sequencers have divided the sequenc¬ 
ing cost by 10,000 in the last ten years, which is 100 
times cheaper than the reduction in cost predicted 
by Moore’s Law. 1781 

• The NASA Center for Climate Simulation (NCCS) 
stores 32 petabytes of climate observations and sim¬ 
ulations on the Discover supercomputing cluster. 1791 

10.6 Research activities 

Encrypted search and cluster formation in big data was 
demonstrated in March 2014 at the American Society 
of Engineering Education. Gautam Siwach engaged at 
Tackling the challenges of Big Data by MIT Computer 
Science and Artificial Intelligence Laboratory and Dr. 
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Amir Esmailpour at UNH Research Group investigated 
the key features of big data as formation of clusters and 
their interconnections. They focused on the security of 
big data and the actual orientation of the term towards 
the presence of different type of data in an encrypted 
form at cloud interface by providing the raw definitions 
and real time examples within the technology. Moreover, 
they proposed an approach for identifying the encoding 
technique to advance towards an expedited search over 
encrypted text leading to the security enhancements in 
big data. 1801 

In March 2012, The White House announced a national 
“Big Data Initiative” that consisted of six Federal depart¬ 
ments and agencies committing more than $200 million 
to big data research projects. 1811 

The initiative included a National Science Foundation 
“Expeditions in Computing” grant of $10 million over 
5 years to the AMPLab 1821 at the University of Califor¬ 
nia, Berkeley. 18 -' 1 The AMPFab also received funds from 
DARPA, and over a dozen industrial sponsors and uses 
big data to attack a wide range of problems from predict¬ 
ing traffic congestion 1841 to fighting cancer. 1851 

The White House Big Data Initiative also included a com¬ 
mitment by the Department of Energy to provide $25 
million in funding over 5 years to establish the Scalable 
Data Management, Analysis and Visualization (SDAV) 
Institute, 1861 led by the Energy Department’s Lawrence 
Berkeley National Laboratory. The SDAV Institute aims 
to bring together the expertise of six national laborato¬ 
ries and seven universities to develop new tools to help 
scientists manage and visualize data on the Department’s 
supercomputers. 

The U.S. state of Massachusetts announced the Mas¬ 
sachusetts Big Data Initiative in May 2012, which pro¬ 
vides funding from the state government and private 
companies to a variety of research institutions. 1871 The 
Massachusetts Institute of Technology hosts the Intel Sci¬ 
ence and Technology Center for Big Data in the MIT 
Computer Science and Artificial Intelligence Laboratory, 
combining government, corporate, and institutional fund¬ 
ing and research efforts. 1881 

The European Commission is funding the 2-year-long Big 
Data Public Private Forum through their Seventh Frame¬ 
work Program to engage companies, academics and other 
stakeholders in discussing big data issues. The project 
aims to define a strategy in terms of research and innova¬ 
tion to guide supporting actions from the European Com¬ 
mission in the successful implementation of the big data 
economy. Outcomes of this project will be used as input 
for Horizon 2020, their next framework program. 1891 

The British government announced in March 2014 the 
founding of the Alan Turing Institute, named after the 
computer pioneer and code-breaker, which will focus on 
new ways of collecting and analysing large sets of data. 190 ' 

At the University of Waterloo Stratford Campus Cana¬ 


dian Open Data Experience (CODE) Inspiration Day, 
it was demonstrated how using data visualization tech¬ 
niques can increase the understanding and appeal of big 
data sets in order to communicate a story to the world. 1911 

In order to make manufacturing more competitive in 
the United States (and globe), there is a need to in¬ 
tegrate more American ingenuity and innovation into 
manufacturing ; Therefore, National Science Founda¬ 
tion has granted the Industry University cooperative re¬ 
search center for Intelligent Maintenance Systems (IMS) 
at university of Cincinnati to focus on developing ad¬ 
vanced predictive tools and techniques to be applicable 
in a big data environment. 16111921 In May 2013, IMS Cen¬ 
ter held an industry advisory board meeting focusing on 
big data where presenters from various industrial compa¬ 
nies discussed their concerns, issues and future goals in 
Big Data environment. 

Computational social sciences — Anyone can use Appli¬ 
cation Programming Interfaces (APIs) provided by Big 
Data holders, such as Google and Twitter, to do research 
in the social and behavioral sciences. 1931 Often these APIs 
are provided for free. 19 ’ 1 Tobias Preis et al. used Google 
Trends data to demonstrate that Internet users from coun¬ 
tries with a higher per capita gross domestic product 
(GDP) are more likely to search for information about 
the future than information about the past. The findings 
suggest there may be a link between online behaviour and 
real-world economic indicators. 194119511961 The authors of 
the study examined Google queries logs made by ratio of 
the volume of searches for the coming year (‘2011’) to the 
volume of searches for the previous year (‘2009’), which 
they call the ‘future orientation index’. 1971 They compared 
the future orientation index to the per capita GDP of 
each country and found a strong tendency for countries 
in which Google users enquire more about the future to 
exhibit a higher GDP. The results hint that there may po¬ 
tentially be a relationship between the economic success 
of a country and the information-seeking behavior of its 
citizens captured in big data. 

Tobias Preis and his colleagues Helen Susannah Moat 
and H. Eugene Stanley introduced a method to iden¬ 
tify online precursors for stock market moves, using 
trading strategies based on search volume data pro¬ 
vided by Google Trends. 1981 Their analysis of Google 
search volume for 98 terms of varying financial rele¬ 
vance, published in Scientific Reports , 1991 suggests that 
increases in search volume for financially relevant 
search terms tend to precede large losses in financial 
markets. 1100111011110211103111041110511106111071 


10.7 Critique 

Critiques of the big data paradigm come in two flavors, 
those that question the implications of the approach itself, 
and those that question the way it is currently done. 
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"Your recent Amazon purchases, Tweet 
score and location history makes you 
23.5% welcome here." 


Cartoon critical of big data application, by T. Gregorius 

10.7.1 Critiques of the big data paradigm 

“A crucial problem is that we do not know much about 
the underlying empirical micro-processes that lead to the 
emergence of the[se] typical network characteristics of 
Big Data”. 1 14 In their critique, Snyders, Matzat, and 
Reips point out that often very strong assumptions are 
made about mathematical properties that may not at all 
reflect what is really going on at the level of micro¬ 
processes. Mark Graham has leveled broad critiques at 
Chris Anderson's assertion that big data will spell the end 
of theory: focusing in particular on the notion that big 
data will always need to be contextualized in their social, 
economic and political contexts. 11081 Even as companies 
invest eight- and nine-figure sums to derive insight from 
information streaming in from suppliers and customers, 
less than 40% of employees have sufficiently mature pro¬ 
cesses and skills to do so. To overcome this insight deficit, 
“big data”, no matter how comprehensive or well ana¬ 
lyzed, needs to be complemented by “big judgment”, ac¬ 
cording to an article in the Harvard Business Review. 1 1091 

Much in the same fine, it has been pointed out that the 
decisions based on the analysis of big data are inevitably 
“informed by the world as it was in the past, or, at best, as 
it currently is”. 1561 Fed by a large number of data on past 
experiences, algorithms can predict future development 
if the future is similar to the past. If the systems dynam¬ 
ics of the future change, the past can say little about the 
future. For this, it would be necessary to have a thor¬ 
ough understanding of the systems dynamic, which im¬ 
plies theory. 11101 As a response to this critique it has been 
suggested to combine big data approaches with computer 
simulations, such as agent-based models 1561 and Complex 
Systems. 11111 Agent-based models are increasingly get¬ 
ting better in predicting the outcome of social complexi¬ 


ties of even unknown future scenarios through computer 
simulations that are based on a collection of mutually in¬ 
terdependent algorithms. 1112111131 In addition, use of mul¬ 
tivariate methods that probe for the latent structure of 
the data, such as factor analysis and cluster analysis, have 
proven useful as analytic approaches that go well beyond 
the bi-variate approaches (cross-tabs) typically employed 
with smaller data sets. 

In health and biology, conventional scientific approaches 
are based on experimentation. For these approaches, the 
limiting factor is the relevant data that can confirm or 
refute the initial hypothesis. 11141 A new postulate is ac¬ 
cepted now in biosciences: the information provided by 
the data in huge volumes (omics) without prior hypoth¬ 
esis is complementary and sometimes necessary to con¬ 
ventional approaches based on experimentation. In the 
massive approaches it is the formulation of a relevant hy¬ 
pothesis to explain the data that is the limiting factor. 
The search logic is reversed and the limits of induction 
(“Glory of Science and Philosophy scandal”, C. D. Broad, 
1926) are to be considered. 

Privacy advocates are concerned about the threat to pri¬ 
vacy represented by increasing storage and integration of 
personally identifiable information; expert panels have re¬ 
leased various policy recommendations to conform prac¬ 
tice to expectations of privacy. 1 115 II 116 11 117 1 


10.7.2 Critiques of big data execution 

Big data has been called a “fad” in scientific research 
and its use was even made fun of as an absurd prac¬ 
tice in a satirical example on “pig data”. 1931 Researcher 
danah boyd has raised concerns about the use of big 
data in science neglecting principles such as choosing a 
representative sample by being too concerned about ac¬ 
tually handling the huge amounts of data. 11181 This ap¬ 
proach may lead to results bias in one way or another. 
Integration across heterogeneous data resources—some 
that might be considered “big data” and others not— 
presents formidable logistical as well as analytical chal¬ 
lenges, but many researchers argue that such integrations 
are likely to represent the most promising new frontiers 
in science. 11191 In the provocative article “Critical Ques¬ 
tions for Big Data”, 11201 the authors title big data a part 
of mythology: “large data sets offer a higher form of in¬ 
telligence and knowledge [...], with the aura of truth, ob¬ 
jectivity, and accuracy”. Users of big data are often “lost 
in the sheer volume of numbers”, and “working with Big 
Data is still subjective, and what it quantifies does not 
necessarily have a closer claim on objective truth”. 11201 
Recent developments in BI domain, such as pro-active 
reporting especially target improvements in usability of 
Big Data, through automated filtering of non-useful data 
and correlations. 11211 

Big data analysis is often shallow compared to analysis of 
smaller data sets. 11221 In many big data projects, there is 
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no large data analysis happening, but the challenge is the 
extract, transform, load part of data preprocessing. 11221 

Big data is a buzzword and a “vague term”, 11231 but at the 
same time an “obsession” 11231 with entrepreneurs, consul¬ 
tants, scientists and the media. Big data showcases such 
as Google Flu Trends failed to deliver good predictions 
in recent years, overstating the flu outbreaks by a factor 
of two. Similarly, Academy awards and election predic¬ 
tions solely based on Twitter were more often off than on 
target. Big data often poses the same challenges as small 
data; and adding more data does not solve problems of 
bias, but may emphasize other problems. In particular 
data sources such as Twitter are not representative of the 
overall population, and results drawn from such sources 
may then lead to wrong conclusions. Google Translate - 
which is based on big data statistical analysis of text - does 
a remarkably good job at translating web pages. How¬ 
ever, results from specialized domains may be dramati¬ 
cally skewed. On the other hand, big data may also in¬ 
troduce new problems, such as the multiple comparisons 
problem: simultaneously testing a large set of hypothe¬ 
ses is likely to produce many false results that mistak¬ 
enly appear to be significant. Ioannidis argued that “most 
published research findings are false” 11241 due to essen¬ 
tially the same effect: when many scientific teams and re¬ 
searchers each perform many experiments (i.e. process a 
big amount of scientific data; although not with big data 
technology), the likelihood of a “significant” result being 
actually false grows fast - even more so, when only posi¬ 
tive results are published. 

10.8 See also 

• Apache Accumulo 

• Apache Hadoop 

• Big Data to Knowledge 

• Data Defined Storage 

• Data mining 

• Cask (company) 

• Cloudera 

• HPCC Systems 

• Intelligent Maintenance Systems 

• Internet of Things 

• MapReduce 

• Hortonworks 

• Oracle NoSQL Database 

• Nonlinear system identification 


• Operations research 

• Programming with Big Data in R (a series of R pack¬ 
ages) 

• Sqrrl 

• Supercomputer 

• Talend 

• Transreality gaming 

• Tuple space 

• Unstructured data 
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Chapter 11 

Euclidean distance 


In mathematics, the Euclidean distance or Euclidean 

metric is the “ordinary” (i.e straight line) distance be¬ 
tween two points in Euclidean space. With this distance. 
Euclidean space becomes a metric space. The associ¬ 
ated norm is called the Euclidean norm. Older literature 
refers to the metric as Pythagorean metric. A general¬ 
ized term for the Euclidean norm is the L 2 norm or L 2 
distance. 


11.1 Definition 

The Euclidean distance between points p and q is the 
length of the line segment connecting them ( pq ). 

In Cartesian coordinates, if p = (pi, p 2 ,..„ pn) and q = 
(qi, q 2 ,..., qn ) are two points in Euclidean n-space, then 
the distance (d) from p to q, or from q to p is given by 
the Pythagorean formula: 


The position of a point in a Euclidean n-space is a 
Euclidean vector. So, p and q are Euclidean vectors, 
starting from the origin of the space, and their tips in¬ 
dicate two points. The Euclidean norm, or Euclidean 
length, or magnitude of a vector measures the length of 
the vector: 


IIpII = \JpI+pI + ---+pI = 

where the last equation involves the dot product. 

A vector can be described as a directed line segment from 
the origin of the Euclidean space (vector tail), to a point 
in that space (vector tip). If we consider that its length 
is actually the distance from its tail to its tip, it becomes 
clear that the Euclidean norm of a vector is just a spe¬ 
cial case of Euclidean distance: the Euclidean distance 
between its tail and its tip. 

The distance between points p and q may have a direction 
(e.g. from p to q), so it may be represented by another 
vector, given by 


Q-P= (?1 ~Pl,Q 2 -P 2 ,--- ; Qn Pn) 

In a three-dimensional space (n=3), this is an arrow from 
p to q, which can be also regarded as the position of q 
relative to p. It may be also called a displacement vector 
if p and q represent two positions of the same point at 
two successive instants of time. 

The Euclidean distance between p and q is just the Eu¬ 
clidean length of this distance (or displacement) vector: 


which is equivalent to equation 1, and also to: 

llq - p|| = \/||p|l 2 + llq|l 2 - 2 p-q. 

11.1.1 One dimension 

In one dimension, the distance between two points on the 
real line is the absolute value of their numerical differ¬ 
ence. Thus if x and y are two points on the real line, then 
the distance between them is given by: 

V(x-y ) 2 = \x-y\. 

In one dimension, there is a single homogeneous, 
translation-invariant metric (in other words, a distance 
that is induced by a norm), up to a scale factor of length, 
which is the Euclidean distance. In higher dimensions 
there are other possible norms. 

11.1.2 Two dimensions 

In the Euclidean plane, if p = (pi, p 2 ) and q = (q \, q 2 ) 
then the distance is given by 

d(p,q) = V(<11 - PI ) 2 + (92 -P2) 2 - 
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This is equivalent to the Pythagorean theorem. 

Alternatively, it follows from (2) that if the polar coordi¬ 
nates of the point p are 0 1? 0 \) and those of q are (r 2 , 
62 ), then the distance between the points is 

\j r 1 + r 2 ~ 2r l r 2 COS(01 - 0 2 ). 

11.1.3 Three dimensions 

In three-dimensional Euclidean space, the distance is 


• Metric 

• Minkowski distance is a generalization that uni¬ 
fies Euclidean distance, Manhattan distance, and 
Chebyshev distance. 

• Pythagorean addition 

11.3 References 

• Deza, Elena; Deza, Michel Marie (2009). Encyclo¬ 
pedia of Distances. Springer, p. 94. 


• “Cluster analysis”. March 2, 2011. 

d(p, q) = \J(pi - qi) 2 + (P 2 - qi) 2 + (p3 - q 3 ) 2 - 


11.1.4 n dimensions 

In general, for an n-dimensional space, the distance is 


d(p, q) = \J(pi - qi ) 2 + (P2 - qi ) 2 H- \-{pi- qi ) 2 + —I -ip n - q n ) 2 - 

11.1.5 Squared Euclidean distance 

The standard Euclidean distance can be squared in order 
to place progressively greater weight on objects that are 
farther apart. In this case, the equation becomes 


d 2 {p,q) = ( Pi-qi) 2 +(p2-q2) 2 ~\ - \-{Pi-qi) 2 +---+(pn~q n ) 2 - 

Squared Euclidean Distance is not a metric as it does not 
satisfy the triangle inequality, however, it is frequently 
used in optimization problems in which distances only 
have to be compared. 

It is also referred to as quadrance within the field of 
rational trigonometry. 


11.2 See also 

• Chebyshev distance measures distance assuming 
only the most significant dimension is relevant. 

• Euclidean distance matrix 

• Hamming distance identifies the difference bit by bit 
of two strings 

• Mahalanobis distance normalizes based on a co- 
variance matrix to make the distance metric scale- 
invariant. 

• Manhattan distance measures distance following 
only axis-aligned directions. 





Chapter 12 

Hamming distance 


In information theory, the Hamming distance between 
two strings of equal length is the number of positions at 
which the corresponding symbols are different. In an¬ 
other way, it measures the minimum number of substitu¬ 
tions required to change one string into the other, or the 
minimum number of errors that could have transformed 
one string into the other. 

A major application is in coding theory, more specifi¬ 
cally to block codes, in which the equal-length strings are 
vectors over a finite field. 


12.1 Examples 

The Hamming distance between: 

• "karolin" and "kathrin" is 3. 

• "karolin" and "kerstin" is 3. 

• 1011101 and 1001001 is 2 . 

• 2173896 and 2233796 is 3. 

On a two-dimensional grid such as a chessboard, the 
Hamming distance is the minimum number of moves it 
would take a rook to move from one cell to the other. 


12.2 Properties 

For a fixed length n, the Hamming distance is a metric 
on the vector space of the words of length n (also known 
as a Hamming space), as it fulfills the conditions of non¬ 
negativity, identity of indiscernibles and symmetry, and 
it can be shown by complete induction that it satisfies the 
triangle inequality as well . 11 The Hamming distance be¬ 
tween two words a and b can also be seen as the Hamming 
weight of a-b for an appropriate choice of the - operator. 

For binary strings a and b the Hamming distance is equal 
to the number of ones (population count) in a XOR b. 
The metric space of length-/; binary strings, with the 
Hamming distance, is known as the Hamming cube ; it is 


equivalent as a metric space to the set of distances be¬ 
tween vertices in a hypercube graph. One can also view a 
binary string of length n as a vector in R n by treating each 
symbol in the string as a real coordinate; with this embed¬ 
ding, the strings form the vertices of an //-dimensional 
hypercube, and the Hamming distance of the strings is 
equivalent to the Manhattan distance between the ver¬ 
tices. 

12.3 Error detection and error cor¬ 
rection 

The Hamming distance is used to define some essential 
notions in coding theory, such as error detecting and er¬ 
ror correcting codes. In particular, a code C is said to be 
k-errors detecting if any two codewords ci and C 2 from C 
that have a Hamming distance less than k coincide; Oth¬ 
erwise put it, a code is ^-errors detecting if and only if 
the minimum Hamming distance between any two of its 
codewords is at least k+l J 11 

A code C is said to be k-errors correcting if for every 
word w in the underlying Hamming space H there exists 
at most one codeword c (from C) such that the Hamming 
distance between w and c is less than k. In other words, 
a code is ^-errors correcting if and only if the minimum 
Hamming distance between any two of its codewords is 
at least 2k+l. This is more easily understood geometri¬ 
cally as any closed balls of radius k centered on distinct 
codewords being disjoint . 1 ' 1 These balls are also called 
Hamming spheres in this context. 

Thus a code with minimum Hamming distance d between 
its codewords can detect at most d— 1 errors and can cor¬ 
rect L(t/-1)/2J errors . 1 1 The latter number is also called 
the packing radius or the error-correcting capability of the 
code . 121 


12.4 History and applications 

The Hamming distance is named after Richard Ham¬ 
ming, who introduced it in his fundamental paper on 
Hamming codes Error detecting and error correcting 
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codes in 1950. 131 Hamming weight analysis of bits is 
used in several disciplines including information theory, 
coding theory, and cryptography. 

It is used in telecommunication to count the number of 
flipped bits in a fixed-length binary word as an estimate 
of error, and therefore is sometimes called the signal dis¬ 
tance. For q- ary strings over an alphabet of size q > 
2 the Hamming distance is applied in case of the q-ary 
symmetric channel, while the Lee distance is used for 
phase-shift keying or more generally channels suscepti¬ 
ble to synchronization errors because the Lee distance ac¬ 
counts for errors of ±1. [41 If q = 2 or q = 3 both distances 
coincide because Z/2Z and Z/3Z are also fields, but Z/4Z 
is not a field but only a ring. 

The Hamming distance is also used in systematics as a 
measure of genetic distance. 

However, for comparing strings of different lengths, or 
strings where not just substitutions but also insertions or 
deletions have to be expected, a more sophisticated met¬ 
ric like the Levenshtein distance is more appropriate. 


12.5 Algorithm example 

The Python function hamming_distance() computes the 
Hamming distance between two strings (or other iterable 
objects) of equal length, by creating a sequence of 
Boolean values indicating mismatches and matches be¬ 
tween corresponding positions in the two inputs, and then 
summing the sequence with False and True values being 
interpreted as zero and one. 

def hamming_distance(sl, s2): """Return the Hamming 
distance between equal-length sequences""" if len(sl) != 
Ien(s2): raise ValueError(“Undefined for sequences of 
unequal length”) return sum fell 1 != ch2 for chi, ch2 in 
zip(sl, s2)) 

The following C function will compute the Hamming dis¬ 
tance of two integers (considered as binary values, that 
is, as sequences of bits). The running time of this pro¬ 
cedure is proportional to the Hamming distance rather 
than to the number of bits in the inputs. It computes the 
bitwise exclusive or of the two inputs, and then finds the 
Hamming weight of the result (the number of nonzero 
bits) using an algorithm of Wegner (1960) that repeat¬ 
edly finds and clears the lowest-order nonzero bit. 

int hamming_distance(unsigned x, unsigned y) { int dist 
= 0; unsigned val = x A y; // Count the number of bits set 
while (val != 0) { // A bit is set, so increment the count 
and clear the bit dist++; val &= val - 1; } // Return the 
number of differing bits return dist; } 


12.6 See also 

• Closest string 

• Damerau-Levenshtein distance 

• Euclidean distance 

• Mahalanobis distance 

• Jaccard index 

• String metric 

• Sprensen similarity index 

• Word ladder 


12.7 Notes 
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[2] Cohen. G.; Honkala, I.; Litsyn, S.; Lobstein, A. (1997), 
Covering Codes , North-Holland Mathematical Library 54, 
Elsevier, pp. 16-17, ISBN 9780080530079 

[3] Hamming (1950). 

[4] Ron Roth (2006). Introduction to Coding Theory. Cam¬ 
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Chapter 13 


Norm (mathematics) 


This article is about linear algebra and analysis. For 
field theory, see Field norm. For ideals, see Ideal norm. 
For group theory, see Norm (group). For norms in 
descriptive set theory, see prewellordering. 

In linear algebra, functional analysis and related areas of 
mathematics, a norm is a function that assigns a strictly 
positive length or size to each vector in a vector space — 
save for the zero vector, which is assigned a length of zero. 
A seminorm, on the other hand, is allowed to assign zero 
length to some non-zero vectors (in addition to the zero 
vector). 

A norm must also satisfy certain properties pertaining to 
scalability and additivity which are given in the formal 
definition below. 

A simple example is the 2-dimensional Euclidean space 
R 2 equipped with the Euclidean norm. Elements in this 
vector space (e.g., (3, 7)) are usually drawn as arrows in a 
2 -dimensional cartesian coordinate system starting at the 
origin (0, 0). The Euclidean norm assigns to each vector 
the length of its arrow. Because of this, the Euclidean 
norm is often known as the magnitude. 

A vector space on which a norm is defined is called a 
normed vector space. Similarly, a vector space with a 
seminorm is called a seminormed vector space. It is of¬ 
ten possible to supply a norm for a given vector space in 
more than one way. 

13.1 Definition 

Given a vector space V over a subfield F of the complex 
numbers, a norm on V is a function p\ V —> R with the 
following properties : 111 

For all a e F and all u, v e V, 

1. p(a\) = Id p(\), ( absolute homogeneity or absolute 
scalability). 

2 . /Xu + v) < p(u) + p(\) ( triangle inequality or 
subadditivity). 

3. If p(\) = 0 then v is the zero vector ( separates points). 


By the first axiom, absolute homogeneity, we have p(0) = 
0 and p(-\) = /Xv), so that by the triangle inequality 

/Xv) ^ 0 (positivity). 

A seminorm on V is a function p : V —»R with the prop¬ 
erties 1 . and 2 . above. 

Every vector space V with seminorm p induces a normed 
space V/W, called the quotient space, where W is the sub¬ 
space of V consisting of all vectors v in V with /Xv) = 0. 
The induced norm on V/W is clearly well-defined and is 
given by: 

p(W + \) = p(\). 

Two norms (or seminorms) p and q on a vector space V 
are equivalent if there exist two real constants c and C, 
with c > 0 such that 

for every vector v in V, one has that: c q(\) < 

/Xv) < C q(\). 

A topological vector space is called normable 
(seminormable) if the topology of the space can 
be induced by a norm (seminorm). 

13.2 Notation 

If a norm p: V —»R is given on a vector space V then the 
norm of a vector ve Vis usually denoted by enclosing it 
within double vertical lines: ||v|| = /Xv). Such notation is 
also sometimes used if p is only a seminorm. 

For the length of a vector in Euclidean space (which is an 
example of a norm, as explained below), the notation Ivl 
with single vertical fines is also widespread. 

In Unicode, the codepoint of the “double vertical line” 
character || is U+2016. The double vertical line should 
not be confused with the “parallel to” symbol, Unicode 
U+2225 ( |). This is usually not a problem because the 
former is used in parenthesis-like fashion, whereas the lat¬ 
ter is used as an infix operator. The double vertical fine 
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used here should also not be confused with the symbol 
used to denote lateral clicks, Unicode U+01C1 (II). The 
single vertical line I is called “vertical line” in Unicode 
and its codepoint is U+007C. 

13.3 Examples 

• All norms are seminorms. 

• The trivial seminorm has p(x) = 0 for all x in V. 

• Every linear form / on a vector space defines a semi¬ 
norm by x —»l/(x)l. 

13.3.1 Absolute-value norm 

The absolute value 

INI = M 

is a norm on the one-dimensional vector spaces formed 
by the real or complex numbers. 

13.3.2 Euclidean norm 

Main article: Euclidean distance 

On an //-dimensional Euclidean space R", the intuitive 
notion of length of the vector jr = (x\, x- 2 , ..., xn) is cap¬ 
tured by the formula 

11*11 : = \Jx\ +■■■ + ■.x 2 n . 

This gives the ordinary distance from the origin to the 
point x. a consequence of the Pythagorean theorem. The 
Euclidean norm is by far the most commonly used norm 
on R", but there are other norms on this vector space as 
will be shown below. However all these norms are equiv¬ 
alent in the sense that they all define the same topology. 

On an /(-dimensional complex space C" the most com¬ 
mon norm is 


ll z ll y\ z l\ +’’’ + l 2 n| — V ~1~1 + • • • + Z n z n . 

In both cases we can also express the norm as the square 
root of the inner product of the vector and itself: 

||a;|| := Vx* x , 

where x is represented as a column vector ([x-i; x-i, ...; 
xn]), and x' denotes its conjugate transpose. 


This formula is valid for any inner product space, includ¬ 
ing Euclidean and complex spaces. For Euclidean spaces, 
the inner product is equivalent to the dot product. Hence, 
in this specific case the formula can be also written with 
the following notation: 


||*|| := yjx ■ X. 

The Euclidean norm is also called the Euclidean length, 
L 2 distance, ( 2 distance, L 2 norm, or t 1 norm; see L p 

space. 

The set of vectors in R" +1 whose Euclidean norm is a 
given positive constant forms an //-sphere. 

Euclidean norm of a complex number 

The Euclidean norm of a complex number is the absolute 
value (also called the modulus) of it, if the complex plane 
is identified with the Euclidean plane R 2 . This identifi¬ 
cation of the complex number x + iy as a vector in the 
Euclidean plane, makes the quantity \Jx 1 + y 1 (as first 
suggested by Euler) the Euclidean norm associated with 
the complex number. 

13.3.3 Taxicab norm or Manhattan norm 

Main article: Taxicab geometry 

n 

Nil :=£N- 

2 = 1 

The name relates to the distance a taxi has to drive in a 
rectangular street grid to get from the origin to the point 
x. 

The set of vectors whose 1-norm is a given constant forms 
the surface of a cross polytope of dimension equivalent 
to that of the norm minus 1. The Taxicab norm is also 
called the L\ norm. The distance derived from this norm 
is called the Manhattan distance or L\ distance. 

The 1-norm is simply the sum of the absolute values of 
the columns. 

In contrast. 


i=1 

is not a norm because it may yield negative results. 

13.3.4 /;-norm 

Main article: L p space 
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Let p > 1 be a real number. 



Note that for p = 1 we get the taxicab norm, for p = 2 
we get the Euclidean norm, and as p approaches oo the 
p-norm approaches the infinity norm or maximum norm. 
Note that the p-norm is related to the Holder mean. 

This definition is still of some interest for 0 < p < 1, but 
the resulting function does not define a norm, 121 because 
it violates the triangle inequality. What is true for this 
case of 0 < p < 1, even in the measurable analog, is that 
the corresponding EF class is a vector space, and it is also 
true that the function 






s 







\f(x)~g(x)\ p dp 


\\x 


1 


(without pth root) defines a distance that makes L P (X) into 
a complete metric topological vector space. These spaces 
are of great interest in functional analysis, probability 
theory, and harmonic analysis. However, outside trivial 
cases, this topological vector space is not locally convex 
and has no continuous nonzero linear forms. Thus the 
topological dual space contains only the zero functional. 


13.3.6 Zero norm 

In probability and functional analysis, the zero norm in¬ 
duces a complete metric topology for the space of mea- 
sureable functions and for the F-space of sequences with 
F-norm (x n ) i-»- ')T ln 2~ n x n /(l + x n ) , which is dis¬ 
cussed by Stefan Rolewicz in Metric Linear Spaces . [31 


The derivative of the p-norm is given by 


Hamming distance of a vector from zero 


d 

dx k 


Xk \xk\ P 

nxiir 1 


2 


For the special case of p = 2, this becomes 


5 II II _ X k 

IMI 2 ’ 


S'™ “HI- 

13.3.5 Maximum norm (special case of: 
infinity norm, uniform norm, or 
supremum norm) 

Main article: Maximum norm 


or 


||x||oo := max (|ati|, • ■ •, \x n \). 

The set of vectors whose infinity norm is a given constant, 
c, forms the surface of a hypercube with edge length 2c. 


See also: Hamming distance and discrete metric 


In metric geometry, the discrete metric takes the value 
one for distinct points and zero otherwise. When applied 
coordinate-wise to the elements of a vector space, the 
discrete distance defines the Hamming distance , which 
is important in coding and information theory. In the 
field of real or complex numbers, the distance of the dis¬ 
crete metric from zero is not homogeneous in the non¬ 
zero point; indeed, the distance from zero remains one 
as its non-zero argument approaches zero. However, the 
discrete distance of a number from zero does satisfy the 
other properties of a norm, namely the triangle inequality 
and positive definiteness. When applied component-wise 
to vectors, the discrete distance from zero behaves like a 
non-homogeneous “norm”, which counts the number of 
non-zero components in its vector argument; again, this 
non-homogeneous “norm” is discontinuous. 

In signal processing and statistics, David Donoho referred 
to the zero "norm " with quotation marks. Following 
Donoho’s notation, the zero “norm” of x is simply the 
number of non-zero coordinates of x, or the Hamming 
distance of the vector from zero. When this “norm” is lo¬ 
calized to a bounded set, it is the limit of p-norms as p ap¬ 
proaches 0. Of course, the zero “norm” is not a B-norm, 
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because it is not positive homogeneous. It is not even an 
F-norm, because it is discontinuous, jointly and sever¬ 
ally, with respect to the scalar argument in scalar-vector 
multiplication and with respect to its vector argument. 
Abusing terminology, some engineers omit Donoho’s 
quotation marks and inappropriately call the number-of- 
nonzeros function the LO norm (sic.), also misusing the 
notation for the Lebesgue space of measurable functions. 


13.3.7 Other norms 

Other norms on R" can be constructed by combining the 
above; for example 


IM| := 2 |xi| + \J 3|x 2 | 2 +max(|x 3 |, 2 |rc 4 1) 2 

is a norm on R 4 . 

For any norm and any injective linear transformation A 
we can define a new norm of x, equal to 

\\M\ ■ 

In 2D, with A a rotation by 45° and a suitable scaling, 
this changes the taxicab norm into the maximum norm. 
In 2D, each A applied to the taxicab norm, up to inver¬ 
sion and interchanging of axes, gives a different unit ball: 
a parallelogram of a particular shape, size and orienta¬ 
tion. In 3D this is similar but different for the 1-norm 
(octahedrons) and the maximum norm (prisms with par¬ 
allelogram base). 

All the above formulas also yield norms on C" without 
modification. 


13.3.8 Infinite-dimensional case 

The generalization of the above norms to an infinite num¬ 
ber of components leads to the L p spaces, with norms 


IMIp = M p ) resp. ||/|| P)X = ( j | f(x)\ p 

V igN / V Jx 

(for complex-valued sequences x resp. functions / de¬ 
fined on X C 1), which can be further generalized (see 
Haar measure). 

Any inner product induces in a natural way the norm 

INI : = y/(x,x). 

Other examples of infinite dimensional normed vector 
spaces can be found in the Banach space article. 


13.4 Properties 

The concept of unit circle (the set of all vectors of norm 
1) is different in different norms: for the 1-norm the unit 
circle in R 2 is a square, for the 2-norm (Euclidean norm) 
it is the well-known unit circle, while for the infinity norm 
it is a different square. For any p-norm it is a superellipse 
(with congruent axes). See the accompanying illustration. 
Note that due to the definition of the norm, the unit circle 
is always convex and centrally symmetric (therefore, for 
example, the unit ball may be a rectangle but cannot be a 
triangle). 

In terms of the vector space, the seminorm defines a 
topology on the space, and this is a Hausdorff topology 
precisely when the seminorm can distinguish between 
distinct vectors, which is again equivalent to the semi¬ 
norm being a norm. The topology thus defined (by either 
a norm or a seminorm) can be understood either in terms 
of sequences or open sets. A sequence of vectors {v n } is 
said to converge in norm to v if \\v n — t>|| —» 0 as n —► oo 
. Equivalently, the topology consists of all sets that can 
be represented as a union of open balls. 

Two norms |*||a and |»||/3 on a vector space V are called 
equivalent if there exist positive real numbers C and D 
such that for all x in V 


C\\x\\ a <\\x\\ 0 <D\\x\\ a . 

For instance, on C" , if p > r > 0, then 

ll x ll P 5= ll x llr — n ( - 1 / r_1 / p ) ||x|| p . 

In particular, 

M 2 < IMIi < Vn\\x \\ 2 
IML < Ml 2 <Vn IMloo 
IML < Mi < «IML • 

If the vector space is a finite-dimensional real or complex 
one, all norms are equivalent. On the other hand, in the 
case of infinite-dimensional vector spaces, not all norms 
are equivalent. 

Emipjalent norms define the same notions of continuity 
danp convergence and for many purposes do not need to 
be distinguished. To be more precise the uniform struc¬ 
ture defined by equivalent norms on the vector space is 
uniformly isomorphic. 

Every (semi)-norm is a sublinear function, which implies 
that every norm is a convex function. As a result, finding 
a global optimum of a norm-based objective function is 
often tractable. 

Given a finite family of seminorms pi on a vector space 
the sum 
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n 

p(x) '■= J2 pi ( x ) 

2—0 

is again a seminorm. 

For any norm p on a vector space V, we have that for all 
u and v e V: 

p(u ± v) > l/Xu) -p(v)l. 

Proof: Applying the triangular inequality to both p[u— 0) 
and p(v — 0) : 

p{u— 0) < p(u—v)+p(v— 0) => p(u—v) > p(u)—p(v) 

p{u— 0) < p(m+v)+p(0—u) => p(u+n) > p{u)—p(v) 
p{v— 0) < p(u—v)+p(u— 0) =>■ p(u—v ) > p(v)—p(u) 
p(v— 0) < p(it+u)+p(0—u) => p(it+u) > p(v)—p(u) 

Thus, p(u ± v) > l/?(u) - p(v)l. 

If X and Y are normed spaces and u : X —> Y is a 
continuous linear map, then the norm of u and the norm 
of the transpose of u are equal. 141 

For the l p norms, we have Holder’s inequality 151 

\x T y\ < ||a;|| p ||t/|| ? ^ ^ = 1. 

A special case of this is the Cauchy-Schwarz inequal¬ 
ity^ 

\x T y\ < ||x|| 2 ||y|| 2 . 

13.5 Classification of seminorms: 
absolutely convex absorbing 
sets 

All seminorms on a vector space V can be classified in 
terms of absolutely convex absorbing sets in V. To each 
such set. A, corresponds a seminorm pA called the gauge 
of A, defined as 

pA(x) := inf {a : a > 0, x e aA } 
with the property that 

{x : pA(x) <1)cAc{j: pA(x) < 1}. 


Conversely: 

Any locally convex topological vector space has a local 
basis consisting of absolutely convex sets. A common 
method to construct such a basis is to use a family (p) of 
seminorms p that separates points: the collection of all 
finite intersections of sets {p < 1 In] turns the space into 
a locally convex topological vector space so that every p 
is continuous. 

Such a method is used to design weak and weak* topolo¬ 
gies. 

norm case: 

Suppose now that (p ) contains a single p: since 
(p) is separating, p is a norm, and A = [p< 1} 
is its open unit ball. Then A is an absolutely 
convex bounded neighbourhood of 0, and p = 
pA is continuous. 

The converse is due to Kolmogorov: any locally 
convex and locally bounded topological vector 
space is normable. Precisely: 

If V is an absolutely convex bounded neigh¬ 
bourhood of 0, the gauge gV (so that V = {gV 
< 1}) is a norm. 

13.6 Generalizations 

There are several generalizations of norms and semi¬ 
norms. If p is absolute homogeneity but in place of sub¬ 
additivity we require that 

then p satisfies the triangle inequality but is called a quasi¬ 
seminorm and the smallest value of b for which this holds 
is called the multiplier of p\ if in addition p separates 
points then it is called a quasi-norm. 

On the other hand, if p satisfies the triangle inequality but 
in place of absolute homogeneity we require that 

then p is called a A>seminorm. 

We have the following relationship between quasi¬ 
seminorms and A'-seminorms: 

Suppose that q is a quasi-seminorm on a vector 
space X with multiplier b. If 0 < k < log 2 b 
then there exists A-seminorm p on X equivalent 
to q. 

13.7 See also 

• Normed vector space 

• Asymmetric norm 

• Matrix norm 
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• Gowers norm 

• Mahalanobis distance 

• Manhattan distance 

• Relation of norms and metrics 


13.8 Notes 
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[2] Except in R 1 , where it coincides with the Euclidean norm, 
and R°, where it is trivial. 
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[4] Treves pp. 242-243 
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Chapter 14 

Regularization (mathematics) 


For other uses in related fields, see Regularization 
(disambiguation) . 

Regularization, in mathematics and statistics and partic¬ 
ularly in the fields of machine learning and inverse prob¬ 
lems, refers to a process of introducing additional in¬ 
formation in order to solve an ill-posed problem or to 
prevent overfitting. This information is usually of the 
form of a penalty for complexity, such as restrictions for 
smoothness or bounds on the vector space norm. 

A theoretical justification for regularization is that it at¬ 
tempts to impose Occam’s razor on the solution. From a 
Bayesian point of view, many regularization techniques 
correspond to imposing certain prior distributions on 
model parameters. 

The same idea arose in many fields of science. For 
example, the least-squares method can be viewed as 
a very simple form of regularization. A simple form 
of regularization applied to integral equations, gener¬ 
ally termed Tikhonov regularization after Andrey Niko¬ 
layevich Tikhonov, is essentially a trade-off between fit¬ 
ting the data and reducing a norm of the solution. More 
recently, non-linear regularization methods, including 
total variation regularization have become popular. 

14.1 Regularization in statistics 
and machine learning 

In statistics and machine learning, regularization meth¬ 
ods are used for model selection, in particular to prevent 
overfitting by penalizing models with extreme parame¬ 
ter values. The most common variants in machine learn¬ 
ing are L\ and Li regularization, which can be added to 
learning algorithms that minimize a loss function E(A, 
Y) by instead minimizing E(X, Y) + ot|| w||, where w is 
the model’s weight vector, ||-|| is either the L\ norm or the 
squared L 2 norm, and a is a free parameter that needs 
to be tuned empirically (typically by cross-validation; see 
hyperparameter optimization). This method applies to 
many models. When applied in linear regression, the re¬ 
sulting models are termed lasso or ridge regression, but 
regularization is also employed in (binary and multiclass) 


logistic regression, neural nets, support vector machines, 
conditional random fields and some matrix decompo¬ 
sition methods. L 2 regularization may also be called 
“weight decay”, in particular in the setting of neural nets. 

L\ regularization is often preferred because it produces 
sparse models and thus performs feature selection within 
the learning algorithm, but since the L\ norm is not dif¬ 
ferentiable, it may require changes to learning algorithms, 
in particular gradient-based learners. 11 ' 121 

Bayesian learning methods make use of a prior proba¬ 
bility that (usually) gives lower probability to more com¬ 
plex models. Well-known model selection techniques in¬ 
clude the Akaike information criterion (AIC), minimum 
description length (MDL), and the Bayesian informa¬ 
tion criterion (BIC). Alternative methods of control¬ 
ling overfitting not involving regularization include cross- 
validation. 

Regularization can be used to fine-tune model complexity 
using an augmented error function with cross-validation. 
The data sets used in complex models can produce a 
levelling-off of validation as complexity of the models in¬ 
creases. Training data sets errors decrease while the vali¬ 
dation data set error remains constant. Regularization in¬ 
troduces a second factor which weights the penalty against 
more complex models with an increasing variance in the 
data errors. This gives an increasing penalty as model 
complexity increases.' 3 ' 

Examples of applications of different methods of regu¬ 
larization to the linear model are: 

A linear combination of the LASSO and ridge regression 
methods is elastic net regularization. 


14.2 See also 


• Bayesian interpretation of regularization 


• Regularization by spectral filtering 
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Chapter 15 

Loss function 


In mathematical optimization, statistics, decision theory 
and machine learning, a loss function or cost function 
is a function that maps an event or values of one or more 
variables onto a real number intuitively representing some 
“cost” associated with the event. An optimization prob¬ 
lem seeks to minimize a loss function. An objective 
function is either a loss function or its negative (some¬ 
times called a reward function, a profit function, a utility 
function, etc.), in which case it is to be maximized. 

In statistics, typically a loss function is used for parameter 
estimation, and the event in question is some function of 
the difference between estimated and true values for an 
instance of data. The concept, as old as Laplace, was 
reintroduced in statistics by Abraham Wald in the mid¬ 
dle of the 20th century. 111 In the context of economics, 
for example, this is usually economic cost or regret. In 
classification, it is the penalty for an incorrect classifi¬ 
cation of an example. In actuarial science, it is used in 
an insurance context to model benefits paid over premi¬ 
ums, particularly since the works of Harald Cramer in 
the 1920s. 121 In optimal control the loss is the penalty for 
failing to achieve a desired value. In financial risk man¬ 
agement the function is precisely mapped to a monetary 
loss. 


15.1 Use in statistics 

Parameter estimation for supervised learning tasks such 
as regression or classification can be formulated as the 
minimization of a loss function over a training set. The 
goal of estimation is to find a function that models its input 
well: if it were applied to the training set, it should predict 
the values (or class labels) associated with the samples in 
that set. The loss function quantifies the amount by which 
the prediction deviates from the actual values. 

15.1.1 Definition 

Formally, we begin by considering some family of distri¬ 
butions for a random variable X , that is indexed by some 
0 . 

More intuitively, we can think of X as our “data”, perhaps 


X = (Xi,..., X n ) , where Xi ~ Fg i.i.d. The X is the 
set of things the decision rule will be making decisions 
on. There exists some number of possible ways Fg to 
model our data X, which our decision function can use to 
make decisions. For a finite number of models, we can 
thus think of 6 as the index to this family of probability 
models. For an infinite family of models, it is a set of 
parameters to the family of distributions. 

On a more practical note, it is important to understand 
that, while it is tempting to think of loss functions as nec¬ 
essarily parametric (since they seem to take 0 as a “pa¬ 
rameter”), the fact that 6 is infinite-dimensional is com¬ 
pletely incompatible with this notion; for example, if the 
family of probability functions is uncountably infinite, 0 
indexes an uncountably infinite space. 

From here, given a set A of possible actions, a decision 
rule is a function <5 : x —> A. 

A loss function is a real lower-bounded function L on 0 
x A for some 9 e 0. The value L(9, d(X)) is the cost of 
action <5(X) under parameter 0. 131 


15.2 Expected loss 

The value of the loss function itself is a random quantity 
because it depends on the outcome of a random variable 
X. Both frequentist and Bayesian statistical theory involve 
making a decision based on the expected value of the loss 
function: however this quantity is defined differently un¬ 
der the two paradigms. 


15.2.1 Frequentist expected loss 

We first define the expected loss in the frequentist context. 
It is obtained by taking the expected value with respect to 
the probability distribution, Pd, of the observed data, X. 
This is also referred to as the risk function 141 151161171 of 
the decision rule S and the parameter 6. Here the decision 
rule depends on the outcome of X. The risk function is 
given by 181 
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R(0,6) =EgL(G,S(X)) = [ L(0,S(x)) d P a (x). 

Jx 

15.2.2 Bayesian expected loss 

In a Bayesian approach, the expectation is calculated us¬ 
ing the posterior distribution jt* of the parameter 6: 


p(tt*,o) = / L(0, a) d n*(6) 

Je 

One then should choose the action a* which minimises 
the expected loss. Although this will result in choosing 
the same action as would be chosen using the frequentist 
risk, the emphasis of the Bayesian approach is that one is 
only interested in choosing the optimal action under the 
actual observed data, whereas choosing the actual Bayes 
optimal decision rule, which is a function of all possible 
observations, is a much more difficult problem. 

15.2.3 Economic choice under uncertainty 

In economics, decision-making under uncertainty is of¬ 
ten modelled using the von Neumann-Morgenstern util¬ 
ity function of the uncertain variable of interest, such as 
end-of-period wealth. Since the value of this variable is 
uncertain, so is the value of the utility function; it is the 
expected value of utility that is maximized. 

15.2.4 Examples 

• For a scalar parameter 6, a decision function whose 
output 9 is an estimate of 6, and a quadratic loss 
function 


L{O,§) = {0-0) 2 , 

the risk function becomes the mean squared er¬ 
ror of the estimate. 


R(0,0) =E e (6-0) 2 . 

• In density estimation, the unknown parameter is 
probability density itself. The loss function is typi¬ 
cally chosen to be a norm in an appropriate function 
space. For example, for L 2 norm, 




the risk function becomes the mean integrated 
squared error 


R(fJ) = E\\f-f\\ 2 - 


15.3 Decision rules 

A decision rule makes a choice using an optimality crite¬ 
rion. Some commonly used criteria are: 

• Minimax: Choose the decision rule with the lowest 
worst loss — that is, minimize the worst-case (max¬ 
imum possible) loss: 


argmin max Rid, 5). 

6 « 6 0 

• Invariance: Choose the optimal decision rule which 
satisfies an invariance requirement. 

• Choose the decision rule with the lowest average loss 
(i.e. minimize the expected value of the loss func¬ 
tion): 


argminE ee e[i?(0, (5)] = argmin f R(9 , S) p(0) d,6. 

s s Je ee 

15.4 Selecting a loss function 

Sound statistical practice requires selecting an estima¬ 
tor consistent with the actual acceptable variation ex¬ 
perienced in the context of a particular applied prob¬ 
lem. Thus, in the applied use of loss functions, selecting 
which statistical method to use to model an applied prob¬ 
lem depends on knowing the losses that will be experi¬ 
enced from being wrong under the problem’s particular 
circumstances. [9] 

A common example involves estimating "location.” Un¬ 
der typical statistical assumptions, the mean or average 
is the statistic for estimating location that minimizes the 
expected loss experienced under the squared-error loss 
function, while the median is the estimator that min¬ 
imizes expected loss experienced under the absolute- 
difference loss function. Still different estimators would 
be optimal under other, less common circumstances. 

In economics, when an agent is risk neutral, the objective 
function is simply expressed in monetary terms, such as 
profit, income, or end-of-period wealth. 
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But for risk-averse (or risk-loving) agents, loss is mea¬ 
sured as the negative of a utility function, which rep¬ 
resents satisfaction and is usually interpreted in ordinal 
terms rather than in cardinal (absolute) terms. 

Other measures of cost are possible, for example 
mortality or morbidity in the field of public health or 
safety engineering. 

For most optimization algorithms, it is desirable to have 
a loss function that is globally continuous and differen¬ 
tiable. 

Two very commonly used loss functions are the squared 
loss, L(a ) = a 2 , and the absolute loss, L(a) = |a| . 
However the absolute loss has the disadvantage that it 
is not differentiable at a = 0 . The squared loss has 
the disadvantage that it has the tendency to be domi¬ 
nated by outliers—when summing over a set of a 's (as 
in £" = i I-(a ,)), the final sum tends to be the result of a 
few particularly large a-values, rather than an expression 
of the average a-value. 

The choice of a loss function is not arbitrary. It is very re¬ 
strictive and sometimes the loss function may be charac¬ 
terized by its desirable properties. 1101 Among the choice 
principles are, for example, the requirement of complete¬ 
ness of the class of symmetric statistics in the case of i.i.d. 
observations, the principle of complete information, and 
some others. 

15.5 Loss functions in Bayesian 
statistics 

One of the consequences of Bayesian inference is that 
in addition to experimental data, the loss function does 
not in itself wholly determine a decision. What is im¬ 
portant is the relationship between the loss function and 
the posterior probability. So it is possible to have two 
different loss functions which lead to the same decision 
when the prior probability distributions associated with 
each compensate for the details of each loss function. 

Combining the three elements of the prior probability, 
the data, and the loss function then allows decisions to 
be based on maximizing the subjective expected utility, a 
concept introduced by Leonard J. Savage. 


15.6 Regret 

Main article: Regret (decision theory) 

Savage also argued that using non-Bayesian methods such 
as minimax, the loss function should be based on the idea 
of regret , i.e., the loss associated with a decision should 
be the difference between the consequences of the best 
decision that could have been taken had the underlying 


circumstances been known and the decision that was in 
fact taken before they were known. 


15.7 Quadratic loss function 

The use of a quadratic loss function is common, for ex¬ 
ample when using least squares techniques. It is often 
more mathematically tractable than other loss functions 
because of the properties of variances, as well as being 
symmetric: an error above the target causes the same loss 
as the same magnitude of error below the target. If the 
target is r, then a quadratic loss function is 


A(x) = C(t - x ) 2 

for some constant C; the value of the constant makes no 
difference to a decision, and can be ignored by setting it 
equal to 1. 

Many common statistics, including t-tests, regression 
models, design of experiments, and much else, use least 
squares methods applied using linear regression theory, 
which is based on the quadratric loss function. 

The quadratic loss function is also used in linear- 
quadratic optimal control problems. In these problems, 
even in the absence of uncertainty, it may not be possi¬ 
ble to achieve the desired values of all target variables. 
Often loss is expressed as a quadratic form in the devia¬ 
tions of the variables of interest from their desired values; 
this approach is tractable because it results in linear first- 
order conditions. In the context of stochastic control, the 
expected value of the quadratic form is used. 


15.8 0-1 loss function 

In statistics and decision theory, a frequently used loss 
function is the 0-1 loss function 


Hy,y) = i{y ± y), 

where I is the indicator notation. 


15.9 See also 

• Discounted maximum loss 

• Hinge loss 

• Scoring rule 
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Chapter 16 

Least squares 


The method of least squares is a standard approach 
in regression analysis to the approximate solution of 
overdetermined systems, i.e., sets of equations in which 
there are more equations than unknowns. “Least squares” 
means that the overall solution minimizes the sum of the 
squares of the errors made in the results of every single 
equation. 

The most important application is in data fitting. The 
best fit in the least-squares sense minimizes the sum of 
squared residuals, a residual being the difference between 
an observed value and the fitted value provided by a 
model. When the problem has substantial uncertainties 
in the independent variable (the x variable), then simple 
regression and least squares methods have problems; in 
such cases, the methodology required for fitting errors- 
in-variables models may be considered instead of that for 
least squares. 

Least squares problems fall into two categories: linear or 
ordinary least squares and non-linear least squares, de¬ 
pending on whether or not the residuals are linear in all 
unknowns. The linear least-squares problem occurs in 
statistical regression analysis; it has a closed-form solu¬ 
tion. The non-linear problem is usually solved by iterative 
refinement; at each iteration the system is approximated 
by a linear one, and thus the core calculation is similar in 
both cases. 

Polynomial least squares describes the variance in a pre¬ 
diction of the dependent variable as a function of the 
independent variable and the deviations from the fitted 
curve. 

When the observations come from an exponential family 
and mild conditions are satisfied, least-squares estimates 
and maximum-likelihood estimates are identical. 1 ' 1 The 
method of least squares can also be derived as a method 
of moments estimator. 

The following discussion is mostly presented in terms of 
linear functions but the use of least-squares is valid and 
practical for more general families of functions. Also, 
by iteratively applying local quadratic approximation to 
the likelihood (through the Fisher information), the least- 
squares method may be used to fit a generalized linear 
model. 


For the topic of approximating a function by a sum of 
others using an objective function based on squared dis¬ 
tances, see least squares (function approximation). 
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The result of fitting a set of data points with a quadratic function. 

The least-squares method is usually credited to Carl 
Friedrich Gauss (1795), [21 but it was first published by 
Adrien-Marie Legendre. 131 


16.1 History 

16.1.1 Context 

The method of least squares grew out of the fields of 
astronomy and geodesy as scientists and mathematicians 
sought to provide solutions to the challenges of navigating 
the Earth’s oceans during the Age of Exploration. The ac¬ 
curate description of the behavior of celestial bodies was 
the key to enabling ships to sail in open seas, where sailors 
could no longer rely on land sightings for navigation. 
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Conic fitting a set of points using least-squares approximation. 

The method was the culmination of several advances that 
took place during the course of the eighteenth century: 141 

• The combination of different observations as being 
the best estimate of the true value; errors decrease 
with aggregation rather than increase, perhaps first 
expressed by Roger Cotes in 1722. 

• The combination of different observations taken un¬ 
der the same conditions contrary to simply trying 
one’s best to observe and record a single observa¬ 
tion accurately. The approach was known as the 
method of averages. This approach was notably 
used by Tobias Mayer while studying the librations 
of the moon in 1750, and by Pierre-Simon Laplace 
in his work in explaining the differences in motion 
of Jupiter and Saturn in 1788. 

• The combination of different observations taken un¬ 
der different conditions. The method came to be 
known as the method of least absolute deviation. It 
was notably performed by Roger Joseph Boscovich 
in his work on the shape of the earth in 1757 and 
by Pierre-Simon Laplace for the same problem in 
1799. 

• The development of a criterion that can be evaluated 
to determine when the solution with the minimum 
error has been achieved. Laplace tried to specify 
a mathematical form of the probability density for 
the errors and define a method of estimation that 
minimizes the error of estimation. For this pur¬ 
pose, Laplace used a symmetric two-sided exponen¬ 
tial distribution we now call Laplace distribution to 
model the error distribution, and used the sum of ab¬ 
solute deviation as error of estimation. He felt these 
to be the simplest assumptions he could make, and 
he had hoped to obtain the arithmetic mean as the 
best estimate. Instead, his estimator was the poste¬ 
rior median. 



Carl Friedrich Gauss 


16.1.2 The method 

The first clear and concise exposition of the method of 
least squares was published by Legendre in 1805. 151 The 
technique is described as an algebraic procedure for fit¬ 
ting linear equations to data and Legendre demonstrates 
the new method by analyzing the same data as Laplace for 
the shape of the earth. The value of Legendre’s method 
of least squares was immediately recognized by leading 
astronomers and geodesists of the time. 

In 1809 Carl Friedrich Gauss published his method of 
calculating the orbits of celestial bodies. In that work 
he claimed to have been in possession of the method of 
least squares since 1795. This naturally led to a priority 
dispute with Legendre. However, to Gauss’s credit, he 
went beyond Legendre and succeeded in connecting the 
method of least squares with the principles of probabil¬ 
ity and to the normal distribution. He had managed to 
complete Laplace’s program of specifying a mathemati¬ 
cal form of the probability density for the observations, 
depending on a finite number of unknown parameters, 
and define a method of estimation that minimizes the er¬ 
ror of estimation. Gauss showed that arithmetic mean 
is indeed the best estimate of the location parameter by 
changing both the probability density and the method of 
estimation. He then turned the problem around by ask¬ 
ing what form the density should have and what method 
of estimation should be used to get the arithmetic mean 
as estimate of the location parameter. In this attempt, he 
invented the normal distribution. 

An early demonstration of the strength of Gauss’ Method 
came when it was used to predict the future location 
of the newly discovered asteroid Ceres. On 1 January 
1801, the Italian astronomer Giuseppe Piazzi discovered 
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Ceres and was able to track its path for 40 days before 
it was lost in the glare of the sun. Based on this data, 
astronomers desired to determine the location of Ceres 
after it emerged from behind the sun without solving 
the complicated Kepler’s nonlinear equations of plane¬ 
tary motion. The only predictions that successfully al¬ 
lowed Hungarian astronomer Franz Xaver von Zach to 
relocate Ceres were those performed by the 24-year-old 
Gauss using least-squares analysis. 

In 1810, after reading Gauss’s work, Laplace, after prov¬ 
ing the central limit theorem, used it to give a large sample 
justification for the method of least square and the nor¬ 
mal distribution. In 1822, Gauss was able to state that the 
least-squares approach to regression analysis is optimal in 
the sense that in a linear model where the errors have a 
mean of zero, are uncorrelated, and have equal variances, 
the best linear unbiased estimator of the coefficients is 
the least-squares estimator. This result is known as the 
Gauss-Markov theorem. 

The idea of least-squares analysis was also independently 
formulated by the American Robert Adrain in 1808. In 
the next two centuries workers in the theory of errors and 
in statistics found many different ways of implementing 
least squares. 161 

16.2 Problem statement 

The objective consists of adjusting the parameters of a 
model function to best fit a data set. A simple data set 
consists of n points (data pairs) (a yi), i = 1,.... n, where 
a^is an independent variable and //, is a dependent variable 
whose value is found by observation. The model function 
has the form f(x,/ 3), where m adjustable parameters are 
held in the vector (3 . The goal is to find the parameter 
values for the model which “best” fits the data. The least 
squares method finds its optimum when the sum, S, of 
squared residuals 


height measurements, the plane is a function of two inde¬ 
pendent variables, x and z, say. In the most general case 
there may be one or more independent variables and one 
or more dependent variables at each data point. 

16.3 Limitations 

This regression formulation considers only residuals in 
the dependent variable. There are two rather different 
contexts in which different implications apply: 

• Regression for prediction. Here a model is fitted to 
provide a prediction rule for application in a sim¬ 
ilar situation to which the data used for fitting ap¬ 
ply. Here the dependent variables corresponding to 
such future application would be subject to the same 
types of observation error as those in the data used 
for fitting, ft is therefore logically consistent to use 
the least-squares prediction rule for such data. 

• Regression for fitting a “true relationship”. In stan¬ 
dard regression analysis, that leads to fitting by least 
squares, there is an implicit assumption that er¬ 
rors in the independent variable are zero or strictly 
controlled so as to be negligible. When errors in 
the independent variable are non-negligible, models 
of measurement error can be used; such methods 
can lead to parameter estimates, hypothesis testing 
and confidence intervals that take into account the 
presence of observation errors in the independent 
variables. 171 An alternative approach is to fit a model 
by total least squares; this can be viewed as taking a 
pragmatic approach to balancing the effects of the 
different sources of error in formulating an objec¬ 
tive function for use in model-fitting. 

16.4 Solving the least squares prob¬ 
lem 


n 

i=l 

is a minimum. A residual is defined as the difference be¬ 
tween the actual value of the dependent variable and the 
value predicted by the model. 


The minimum of the sum of squares is found by setting 
the gradient to zero. Since the model contains m param¬ 
eters, there are m gradient equations: 


8S 

Wj 



o, j = 1, 


n = y% - f(xi,P)- 

An example of a model is that of the straight line in two 
dimensions. Denoting the intercept as Pq and the slope as 
Pi , the model function is given by f(x, (3) = Pq + PiX 
. See linear least squares for a fully worked out example 
of this model. 

A data point may consist of more than one independent 
variable. For example, when fitting a plane to a set of 


and since r, = yi — f(xi,f3) , the gradient equations 
become 


A ‘ w, 


0, j = 


The gradient equations apply to all least squares problems. 
Each particular problem requires particular expressions 
for the model and its partial derivatives. 
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16.4.1 Linear least squares 

Main article: Linear least squares 

A regression model is a linear one when the model com¬ 
prises a linear combination of the parameters, i.e.. 


n = yi-f k {xi,f3)~^2 Jik^-Pk = Aj h-^2 JijApj 
fc=i j =i 

To minimize the sum of squares of ry, the gradient equa¬ 
tion is set to zero and solved for A/3,- : 


f(x,P) = x), 

3 = 1 

where the function <f>j is a function of x . 
Letting 


n / m \ 

( Aj/j - ^ JjfcA/3 fe I = 0, 

i=l V fc=l / 

which, on rearrangement, become m simultaneous linear 
equations, the normal equations: 


= df(xj,(3) 
dfr 


4*3 (*G)) 


we can then see that in that case the least square estimate 
(or estimator, in the context of a random sample), (3 is 
given by 


'y ^ 'y . -Ay lik A Sk — 'y ^ -Ay Ay, (j — 1, ..., m). 

2=1 fc=l 2=1 

The normal equations are written in matrix notation as 


$ = (X T X)~ 1 X T y. 

For a derivation of this estimate see Linear least squares 
(mathematics). 


(J T J) m = J T 0y. 

These are the defining equations of the Gauss-Newton 
algorithm. 


16.4.2 Non-linear least squares 

Main article: Non-linear least squares 

There is no closed-form solution to a non-linear least 
squares problem. Instead, numerical algorithms are used 
to find the value of the parameters /? that minimizes the 
objective. Most algorithms involve choosing initial val¬ 
ues for the parameters. Then, the parameters are refined 
iteratively, that is, the values are obtained by successive 
approximation: 

(3j k+1 =Pj k + Aft, 

where k is an iteration number, and the vector of incre¬ 
ments A /3j is called the shift vector. In some commonly 
used algorithms, at each iteration the model may be lin¬ 
earized by approximation to a first-order Taylor series ex¬ 
pansion about (3 k : 

= f k (Xi,/3) + J2 (ft - &*) 

3 3 

= f k ( Xi ,f3) + ^2 J i3 Aft. 

3 

The Jacobian J is a function of constants, the indepen¬ 
dent variable and the parameters, so it changes from one 
iteration to the next. The residuals are given by 


16.4.3 Differences between linear and non¬ 
linear least squares 

• The model function, /, in LLSQ (linear least 
squares) is a linear combination of parameters of 
the form / = Xn/3i + X ^^2 + ■ ■ ■ The model 
may represent a straight line, a parabola or any other 
linear combination of functions. In NLLSQ (non¬ 
linear least squares) the parameters appear as func¬ 
tions, such as /3 2 , e@ x and so forth. If the deriva¬ 
tives df /d/3j are either constant or depend only on 
the values of the independent variable, the model 
is linear in the parameters. Otherwise the model is 
non-linear. 

• Algorithms for finding the solution to a NLLSQ 
problem require initial values for the parameters, 
LLSQ does not. 

• Like LLSQ, solution algorithms for NLLSQ often 
require that the Jacobian be calculated. Analytical 
expressions for the partial derivatives can be compli¬ 
cated. If analytical expressions are impossible to ob¬ 
tain either the partial derivatives must be calculated 
by numerical approximation or an estimate must be 
made of the Jacobian. 

• In NLLSQ non-convergence (failure of the algo¬ 
rithm to find a minimum) is a common phenomenon 
whereas the LLSQ is globally concave so non¬ 
convergence is not an issue. 
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• NLLSQ is usually an iterative process. The iterative 
process has to be terminated when a convergence 
criterion is satisfied. LLSQ solutions can be com¬ 
puted using direct methods, although problems with 
large numbers of parameters are typically solved 
with iterative methods, such as the Gauss-Seidel 
method. 

• In LLSQ the solution is unique, but in NLLSQ there 
may be multiple minima in the sum of squares. 

• Under the condition that the errors are uncorrelated 
with the predictor variables, LLSQ yields unbiased 
estimates, but even under that condition NLLSQ es¬ 
timates are generally biased. 

These differences must be considered whenever the solu¬ 
tion to a non-linear least squares problem is being sought. 

16.5 Least squares, regression 
analysis and statistics 

The method of least squares is often used to generate es¬ 
timators and other statistics in regression analysis. 

Consider a simple example drawn from physics. A spring 
should obey Hooke’s law which states that the extension 
of a spring y is proportional to the force, F, applied to it. 


V = f(F , k) = kF 

constitutes the model, where F is the independent vari¬ 
able. To estimate the force constant, k, a series of n mea¬ 
surements with different forces will produce a set of data, 
(F t . y,), i = 1,,n, where yi is a measured spring 
extension. Each experimental observation will contain 
some error. If we denote this error e , we may specify 
an empirical model for our observations. 


Vi = kFi + £j. 

There are many methods we might use to estimate the 
unknown parameter k. Noting that the n equations in the 
m variables in our data comprise an overdetermined sys¬ 
tem with one unknown and n equations, we may choose 
to estimate k using least squares. The sum of squares to 
be minimized is 


s = - kF ^ 2 ■ 

2 = 1 

The least squares estimate of the force constant, k, is 
given by 


i _ Si FiVi 

Here it is assumed that application of the force causes the 
spring to expand and, having derived the force constant by 
least squares fitting, the extension can be predicted from 
Hooke’s law. 

In regression analysis the researcher specifies an empir¬ 
ical model. For example, a very common model is the 
straight fine model which is used to test if there is a 
linear relationship between dependent and independent 
variable. If a Unear relationship is found to exist, the 
variables are said to be correlated. However, correlation 
does not prove causation, as both variables may be corre¬ 
lated with other, hidden, variables, or the dependent vari¬ 
able may “reverse” cause the independent variables, or 
the variables may be otherwise spuriously correlated. For 
example, suppose there is a correlation between deaths by 
drowning and the volume of ice cream sales at a partic¬ 
ular beach. Yet, both the number of people going swim¬ 
ming and the volume of ice cream sales increase as the 
weather gets hotter, and presumably the number of deaths 
by drowning is correlated with the number of people go¬ 
ing swimming. Perhaps an increase in swimmers causes 
both the other variables to increase. 

In order to make statistical tests on the results it is neces¬ 
sary to make assumptions about the nature of the exper¬ 
imental errors. A common (but not necessary) assump¬ 
tion is that the errors belong to a normal distribution. The 
central limit theorem supports the idea that this is a good 
approximation in many cases. 

• The Gauss-Markov theorem. In a linear model in 
which the errors have expectation zero conditional 
on the independent variables, are uncorrelated and 
have equal variances, the best linear unbiased esti¬ 
mator of any linear combination of the observations, 
is its least-squares estimator. “Best” means that the 
least squares estimators of the parameters have min¬ 
imum variance. The assumption of equal variance 
is valid when the errors all belong to the same dis¬ 
tribution. 

• In a linear model, if the errors belong to a normal 
distribution the least squares estimators are also the 
maximum likelihood estimators. 

However, if the errors are not normally distributed, a 
central limit theorem often nonetheless implies that the 
parameter estimates will be approximately normally dis¬ 
tributed so long as the sample is reasonably large. For this 
reason, given the important property that the error mean 
is independent of the independent variables, the distribu¬ 
tion of the error term is not an important issue in regres¬ 
sion analysis. Specifically, it is not typically important 
whether the error term follows a normal distribution. 
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In a least squares calculation with unit weights, or in lin¬ 
ear regression, the variance on the /th parameter, denoted 
var(/5j) , is usually estimated with 


-2 

i 


df(xj,(3) 

dfr Tl 


= 0 , 


j = l,...,n 


which, in a linear least squares system give the modified 

var(4) = a 2 ([X T X] _1 ) » —— ([X T X] _1 ) , normal equations, 

' 7 jj n — m \ / jj 


where the true residual variance o 2 is replaced by an esti¬ 
mate based on the minimised value of the sum of squares 
objective function S. The denominator, n - m, is the 
statistical degrees of freedom; see effective degrees of 
freedom for generalizations. 

Confidence limits can be found if the probability distribu¬ 
tion of the parameters is known, or an asymptotic approx¬ 
imation is made, or assumed. Likewise statistical tests on 
the residuals can be made if the probability distribution 
of the residuals is known or assumed. The probability 
distribution of any linear combination of the dependent 
variables can be derived if the probability distribution of 
experimental errors is known or assumed. Inference is 
particularly straightforward if the errors are assumed to 
follow a normal distribution, which implies that the pa¬ 
rameter estimates and residuals will also be normally dis¬ 
tributed conditional on the values of the independent vari¬ 
ables. 


16.6 Weighted least squares 


^ ^ ^ ^ XijWaXikfik — ^ ^ XijWuVi, j — 1, . . . , TYl . 

i—1 k— 1 i=l 

When the observational errors are uncorrelated and the 
weight matrix, W, is diagonal, these may be written as 

(X T WX) f) = X T Wy. 

If the errors are correlated, the resulting estimator is the 
BLUE if the weight matrix is equal to the inverse of the 
variance-covariance matrix of the observations. 

When the errors are uncorrelated, it is convenient to 
simplify the calculations to factor the weight matrix as 
Wh = \/Wii . The normal equations can then be written 
as 

(X' T X') 0 = X'V 
where 


See also: Weighted mean and Linear least squares 
(mathematics) § Weighted linear least squares 

A special case of generalized least squares called 
weighted least squares occurs when all the off-diagonal 
entries of 12 (the correlation matrix of the residu¬ 
als) are null; the variances of the observations (along 
the covariance matrix diagonal) may still be unequal 
(heteroskedasticity). 

The expressions given above are based on the implicit as¬ 
sumption that the errors are uncorrelated with each other 
and with the independent variables and have equal vari¬ 
ance. The Gauss-Markov theorem shows that, when this 
is so, 0 is a best linear unbiased estimator (BLUE). If, 
however, the measurements are uncorrelated but have 
different uncertainties, a modified approach might be 
adopted. Aitken showed that when a weighted sum of 
squared residuals is minimized, (3 is the BLUE if each 
weight is equal to the reciprocal of the variance of the 
measurement 


1 

*=i ai 

The gradient equations for this sum of squares are 


X' = wX,y' = wy. 

For non-linear least squares systems a similar argument 
shows that the normal equations should be modified as 
follows. 

(J T WJ) A/3 = J T WAy. 

Note that for empirical tests, the appropriate W is not 
known for sure and must be estimated. For this feasible 
generalized least squares (FGLS) techniques may be 
used. 

16.7 Relationship to principal com¬ 
ponents 

The first principal component about the mean of a set of 
points can be represented by that line which most closely 
approaches the data points (as measured by squared dis¬ 
tance of closest approach, i.e. perpendicular to the line). 
In contrast, linear least squares tries to minimize the dis¬ 
tance in the y direction only. Thus, although the two use 
a similar error metric, linear least squares is a method 
that treats one dimension of the data preferentially, while 
PCA treats all dimensions equally. 
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16.8 Regularized versions 

16.8.1 Tikhonov regularization 

Main article: Tikhonov regularization 

In some contexts a regularized version of the least squares 
solution may be preferable. Tikhonov regularization (or 
ridge regression) adds a constraint that ||/3|| 2 , the L 2 - 
norm of the parameter vector, is not greater than a given 
value. Equivalently, it may solve an unconstrained mini¬ 
mization of the least-squares penalty with ck||/ 3|| 2 added, 
where a is a constant (this is the Lagrangian form of 
the constrained problem). In a Bayesian context, this is 
equivalent to placing a zero-mean normally distributed 
prior on the parameter vector. 


16.8.2 Lasso method 

An alternative regularized version of least squares is lasso 
(least absolute shrinkage and selection operator), which 
uses the constraint that ||/3||i , the L 1 -norm of the pa¬ 
rameter vector, is no greater than a given value. |8||9||1()| 
(As above, this is equivalent to an unconstrained mini¬ 
mization of the least-squares penalty with cc||/3||i added.) 
In a Bayesian context, this is equivalent to placing a 
zero-mean Laplace prior distribution on the parameter 
vector. 1111 The optimization problem may be solved us¬ 
ing quadratic programming or more general convex opti¬ 
mization methods, as well as by specific algorithms such 
as the least angle regression algorithm. 

One of the prime differences between Lasso and ridge 
regression is that in ridge regression, as the penalty is 
increased, all parameters are reduced while still remain¬ 
ing non-zero, while in Lasso, increasing the penalty will 
cause more and more of the parameters to be driven to 
zero. This is an advantage of Lasso over ridge regres¬ 
sion, as driving parameters to zero deselects the features 
from the regression. Thus, Lasso automatically selects 
more relevant features and discards the others, whereas 
Ridge regression never fully discards any features. Some 
feature selection techniques are developed based on the 
LASSO including Bolasso which bootstraps samples, 1121 
and LeaLect which analyzes the regression coefficients 
corresponding to different values of a to score all the 
features. 1131 

The L 1 -regularized formulation is useful in some contexts 
due to its tendency to prefer solutions with fewer nonzero 
parameter values, effectively reducing the number of 
variables upon which the given solution is dependent. 18 ' 
Lor this reason, the Lasso and its variants are fundamen¬ 
tal to the field of compressed sensing. An extension of 
this approach is elastic net regularization. 


16.9 See also 

• Adjustment of observations 

• Bayesian MMSE estimator 

• Best linear unbiased estimator (BLUE) 

• Best linear unbiased prediction (BLUP) 

• Gauss-Markov theorem 

• E 2 norm 

• Least absolute deviation 

• Measurement uncertainty 

• Proximal gradient methods for learning 

• Quadratic loss function 

• Root mean square 

• Squared deviations 
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Chapter 17 

Newton’s method 


This article is about Newton’s method for finding roots. 
For Newton’s method for finding minima, see Newton’s 
method in optimization. 

In numerical analysis, Newton’s method (also known as 
the Newton-Raphson method), named after Isaac New¬ 
ton and Joseph Raphson, is a method for finding succes¬ 
sively better approximations to the roots (or zeroes) of a 
real-valued function. 


x : f(x) = 0 . 



The Newton-Raphson method in one variable is imple¬ 
mented as follows: 

Given a function / defined over the reals x, and its 
derivative /', we begin with a first guess xq for a root of 
the function /. Provided the function satisfies all the as¬ 
sumptions made in the derivation of the formula, a better 
approximation x\ is 


X\ = x 0 - 


f{x o) 
f'{x o) ' 


Geometrically, (xj, 0) is the intersection with the x-axis 
of the tangent to the graph of / at (xo, / Go))- 

The process is repeated as 


Xn -\-1 — X n 


f(x n ) 

f{Xn) 


The function f is shown in blue and the tangent line is in red. We 
see that x n +l is a better approximation than x n for the root x of 
the function f. 


root, then the function is approximated by its tangent 
line (which can be computed using the tools of calculus), 
and one computes the x-intercept of this tangent line 
(which is easily done with elementary algebra). This x- 
intercept will typically be a better approximation to the 
function’s root than the original guess, and the method 
can be iterated. 

Suppose /: [a, b] —»R is a differentiable function defined 
on the interval [a, b] with values in the real numbers R. 
The formula for converging on the root can be easily de¬ 
rived. Suppose we have some current approximation xn. 
Then we can derive the formula for a better approxima¬ 
tion, xn +1 by referring to the diagram on the right. The 
equation of the tangent line to the curve y = /(x) at the 
point x=xn is 


until a sufficiently accurate value is reached. 

This algorithm is first in the class of Householder’s meth¬ 
ods, succeeded by Halley’s method. The method can also 
be extended to complex functions and to systems of equa¬ 
tions. 


17.1 Description 

The idea of the method is as follows: one starts with 
an initial guess which is reasonably close to the true 


V = f{x r i) (X ~ X n ) + f{x n ), 

where, f denotes the derivative of the function/. 

The x-intercept of this line (the value of x such that y=0) 
is then used as the next approximation to the root, xn+\. 
In other words, setting y to zero and x to xn+\ gives 

0 = /'( x n ) (x n+1 - x n ) + f{x n ). 

Solving for xn+\ gives 
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x n + l~X n f/{Xn) . 

We start the process off with some arbitrary initial value 
xq. (The closer to the zero, the better. But, in the ab¬ 
sence of any intuition about where the zero might lie, 
a “guess and check” method might narrow the possibil¬ 
ities to a reasonably small interval by appealing to the 
intermediate value theorem.) The method will usually 
converge, provided this initial guess is close enough to 
the unknown zero, and that f'(x o) + 0. Furthermore, 
for a zero of multiplicity 1, the convergence is at least 
quadratic (see rate of convergence) in a neighbourhood 
of the zero, which intuitively means that the number of 
correct digits roughly at least doubles in every step. More 
details can be found in the analysis section below. 

The Householder’s methods are similar but have higher 
order for even faster convergence. However, the extra 
computations required for each step can slow down the 
overall performance relative to Newton’s method, partic¬ 
ularly if / or its derivatives are computationally expensive 
to evaluate. 


17.2 History 

The name “Newton’s method” is derived from Isaac New¬ 
ton's description of a special case of the method in 
De analysi per aequationes numero terminorum infinitas 
(written in 1669, published in 1711 by William Jones) 
and in De metodis fluxionum etserierum infinitarum (writ¬ 
ten in 1671, translated and published as Method of Flux¬ 
ions in 1736 by John Colson). However, his method dif¬ 
fers substantially from the modern method given above: 
Newton applies the method only to polynomials. He does 
not compute the successive approximations x n , but com¬ 
putes a sequence of polynomials, and only at the end ar¬ 
rives at an approximation for the root x. Finally, Newton 
views the method as purely algebraic and makes no men¬ 
tion of the connection with calculus. Newton may have 
derived his method from a similar but less precise method 
by Vieta. The essence of Vieta’s method can be found in 
the work of the Persian mathematician Sharaf al-Din al- 
Tusi, while his successor Jamshld al-Kashl used a form 
of Newton’s method to solve x p — N = 0 to find roots 
of N (Ypma 1995). A special case of Newton’s method 
for calculating square roots was known much earlier and 
is often called the Babylonian method. 

Newton’s method was used by 17th-century Japanese 
mathematician Seki Kowa to solve single-variable equa¬ 
tions, though the connection with calculus was missing. 

Newton’s method was first published in 1685 in A Treatise 
of Algebra both Historical and Practical by John Wallis. 
In 1690, Joseph Raphson published a simplified descrip¬ 
tion in Analysis aequationum universalis. Raphson again 


viewed Newton’s method purely as an algebraic method 
and restricted its use to polynomials, but he describes the 
method in terms of the successive approximations xn in¬ 
stead of the more complicated sequence of polynomials 
used by Newton. Finally, in 1740, Thomas Simpson de¬ 
scribed Newton’s method as an iterative method for solv¬ 
ing general nonlinear equations using calculus, essentially 
giving the description above. In the same publication, 
Simpson also gives the generalization to systems of two 
equations and notes that Newton’s method can be used 
for solving optimization problems by setting the gradient 
to zero. 

Arthur Cayley in 1879 in The Newton-Fourier imaginary 
problem was the first to notice the difficulties in gener¬ 
alizing Newton’s method to complex roots of polynomi¬ 
als with degree greater than 2 and complex initial values. 
This opened the way to the study of the theory of itera¬ 
tions of rational functions. 


17.3 Practical considerations 

Newton’s method is an extremely powerful technique— 
in general the convergence is quadratic: as the method 
converges on the root, the difference between the root 
and the approximation is squared (the number of accu¬ 
rate digits roughly doubles) at each step. However, there 
are some difficulties with the method. 


17.3.1 Difficulty in calculating derivative 
of a function 

Newton’s method requires that the derivative be calcu¬ 
lated directly. An analytical expression for the deriva¬ 
tive may not be easily obtainable and could be expensive 
to evaluate. In these situations, it may be appropriate to 
approximate the derivative by using the slope of a line 
through two nearby points on the function. Using this 
approximation would result in something like the secant 
method whose convergence is slower than that of New¬ 
ton’s method. 


17.3.2 Failure of the method to converge to 
the root 

It is important to review the proof of quadratic con¬ 
vergence of Newton’s Method before implementing it. 
Specifically, one should review the assumptions made in 
the proof. For situations where the method fails to con¬ 
verge, it is because the assumptions made in this proof 
are not met. 
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Overshoot 

If the first derivative is not well behaved in the neighbor¬ 
hood of a particular root, the method may overshoot, and 
diverge from that root. An example of a function with 
one root, for which the derivative is not well behaved in 
the neighborhood of the root, is 


f(x) = \x\ a , 0 < a < \ 

for which the root will be overshot and the sequence of x 
will diverge. For a = 1/2, the root will still be overshot, 
but the sequence will oscillate between two values. For 
1/2 < a < 1, the root will still be overshot but the sequence 
will converge, and for a > 1 the root will not be overshot 
at all. 

In some cases, Newton’s method can be stabilized by us¬ 
ing successive over-relaxation, or the speed of conver¬ 
gence can be increased by using the same method. 

Stationary point 

If a stationary point of the function is encountered, the 
derivative is zero and the method will terminate due to 
division by zero. 

Poor initial estimate 

A large error in the initial estimate can contribute to non¬ 
convergence of the algorithm. 

Mitigation of non-convergence 

In a robust implementation of Newton’s method, it is 
common to place limits on the number of iterations, 
bound the solution to an interval known to contain the 
root, and combine the method with a more robust root 
finding method. 

17.3.3 Slow convergence for roots of mul¬ 
tiplicity > 1 

If the root being sought has multiplicity greater than one, 
the convergence rate is merely linear (errors reduced by a 
constant factor at each step) unless special steps are taken. 
When there are two or more roots that are close together 
then it may take many iterations before the iterates get 
close enough to one of them for the quadratic conver¬ 
gence to be apparent. However, if the multiplicity m of 
the root is known, one can use the following modified al¬ 
gorithm that preserves the quadratic convergence rate: 

f(x n ) in 

X n+1 = X n - mj^-y 


This is equivalent to using successive over-relaxation. On 
the other hand, if the multiplicity m of the root is not 
known, it is possible to estimate m after carrying out one 
or two iterations, and then use that value to increase the 
rate of convergence. 

17.4 Analysis 

Suppose that the function / has a zero at a, i.e., /(a) = 0, 
and / is differentiable in a neighborhood of a. 

If / is continuously differentiable and its derivative is 
nonzero at a, then there exists a neighborhood of a such 
that for all starting values xo in that neighborhood, the 
sequence {xn} will converge to a. 121 

If the function is continuously differentiable and its 
derivative is not 0 at a and it has a second derivative at 
a then the convergence is quadratic or faster. If the sec¬ 
ond derivative is not 0 at a then the convergence is merely 
quadratic. If the third derivative exists and is bounded in 
a neighborhood of a, then: 

Axqj^CA^f + OM 3 , 

2 j'{a) 

where A Xi = Xi — a. 

If the derivative is 0 at a, then the convergence is usually 
only linear. Specifically, if / is twice continuously dif¬ 
ferentiable, f ’(a) = 0 and f "(a) ± 0, then there exists 
a neighborhood of a such that for all starting values xq 
in that neighborhood, the sequence of iterates converges 
linearly, with rate log 10 2 (Siili & Mayers, Exercise 1.6). 
Alternatively if / '(a) = 0 and / '(x) + 0 for x ± a, x in 
a neighborhood U of a, a being a zero of multiplicity r, 
and if / e C(U) then there exists a neighborhood of a 
such that for all starting values xo in that neighborhood, 
the sequence of iterates converges linearly. 

However, even linear convergence is not guaranteed in 
pathological situations. 

In practice these results are local, and the neighborhood of 
convergence is not known in advance. But there are also 
some results on global convergence: for instance, given a 
right neighborhood U+ of a, if / is twice differentiable 
in U+ and if f ^ 0, / • /" > Oin U+, then, for each xo 
in U+ the sequence xk is monotonically decreasing to a. 

17.4.1 Proof of quadratic convergence for 
Newton’s iterative method 

According to Taylor’s theorem, any function /(x) which 
has a continuous second derivative can be represented by 
an expansion about a point that is close to a root of f(x). 
Suppose this root is a . Then the expansion of f(a) about 


xn is: 
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where the Lagrange form of the Taylor series expansion 
remainder is 


|en+i| < Me n 2 

where M is the supremum of the variable coefficient of 
e n 2 on the interval I defined in the condition 1, that is: 


R i = ^/"(£«)(<*- x n ) 2 , 

where '£/; is in between xn and a . 
Since a is the root, (1) becomes: 


M = sup - 

xGl 2 


m 

f'(x) 


The initial point Xq has to be chosen such that condi¬ 
tions 1 through 3 are satisfied, where the third condition 
requires that M | eo | < 1. 


Dividing equation (2) by f'(x n ) and rearranging gives 

17.4.2 Basins of attraction 


The basins of attraction —the regions of the real num- 
Remembering that xn +x is defined by ber line such that within each region iteration from any 

point leads to one particular root—can be infinite in num¬ 
ber and arbitrarily small. For example, 131 for the function 
f(x) = x 3 — 2a: 2 — 11a; + 12 , the following initial con- 

_ . . ditions are in successive basins of attraction: 

one finds that 


ot X n +1 — , (ct x n ) . 

'-v-' 2/'(x n ) >—v—" 

£ »+i e " 

That is. 


2.35287527 converges to 4; 
2.35284172 converges to -3; 
2.35283735 converges to 4; 
2.352836327 converges to -3; 
2.352836323 converges to 1. 


Taking absolute value of both sides gives 


17.5 Failure analysis 


Equation (6) shows that the rate of convergence is 
quadratic if the following conditions are satisfied: 

1. f{x) ^ 0;Va; £ I where ,/ interval the is [a — 
r,a + r) some for r > |(a — xq)| ; 


Newton’s method is only guaranteed to converge if cer¬ 
tain conditions are satisfied. If the assumptions made in 
the proof of quadratic convergence are met, the method 
will converge. For the following subsections, failure of 
the method to converge indicates that the assumptions 
made in the proof were not met. 


2. f"(x) finite is ,Vx £ /; 

3. xq sufficiently close to the root a 


The term sufficiently close in this context means the fol¬ 
lowing: 


(a) Taylor approximation is accurate enough such that we 
can ignore higher order terms. 


(b) 


f"U n) 

f(x n ) 


< c 


/"(a) 

/'(a) 


some for C < oo, 


17.5.1 Bad starting points 

In some cases the conditions on the function that are nec¬ 
essary for convergence are satisfied, but the point chosen 
as the initial point is not in the interval where the method 
converges. This can happen, for example, if the function 
whose root is sought approaches zero asymptotically as x 
goes to oo or — oo . In such cases a different method, such 
as bisection, should be used to obtain a better estimate for 
the zero to use as an initial point. 


(c) C 


/"(«) 


< 1, for n 


/'(“) 

{0} and C (b) condition satisfying . 


Z+ U 

Iteration point is stationary 


Finally, (6) can be expressed in the following way: Consider the function: 
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fix) = l-x 2 . 

It has a maximum at x = 0 and solutions of fix) = 0 at x = 
±1. If we start iterating from the stationary point Jto = 0 
(where the derivative is zero), x\ will be undefined, since 
the tangent at (0,1) is parallel to the x-axis: 


17.5.2 Derivative issues 

If the function is not continuously differentiable in a 
neighborhood of the root then it is possible that Newton’s 
method will always diverge and fail, unless the solution is 
guessed on the first try. 

Derivative does not exist at root 


Xi 


= X 0 - 


fix o) 
fix o) 



The same issue occurs if, instead of the starting point, 
any iteration point is stationary. Even if the derivative is 
small but not zero, the next iteration will be a far worse 
approximation. 


A simple example of a function where Newton’s method 
diverges is the cube root, which is continuous and in¬ 
finitely differentiable, except for x = 0, where its deriva¬ 
tive is undefined (this, however, does not affect the al¬ 
gorithm, since it will never require the derivative if the 
solution is already found): 


Starting point enters a cycle 


fix) = ¥x. 

For any iteration point xn, the next iteration point will be: 



-3-2-101234 


The tangent lines of x 3 - 2x + 2 at 0 and 1 intersect the x-axis at 
1 and 0 respectively, illustrating why Newton’s method oscillates 
between these values for some starting points. 

For some functions, some starting points may enter an 
infinite cycle, preventing convergence. Let 


Xn +1 — X n r X n - j — — X n O X n — 2 

fiXn) \Xni- 1 

The algorithm overshoots the solution and lands on the 
other side of the y-axis, farther away than it initially was; 
applying Newton’s method actually doubles the distances 
from the solution at each iteration. 

In fact, the iterations diverge to infinity for every f{x) = 
\x\ a , where 0 < a < | . In the limiting case of a = 

| (square root), the iterations will alternate indefinitely 
between points xq and -jto, so they do not converge in 
this case either. 

Discontinuous derivative 

If the derivative is not continuous at the root, then conver¬ 
gence may fail to occur in any neighborhood of the root. 
Consider the function 


0 if.r = 0, 

x + a: 2 sin(|) if.T ^ 0. 

Its derivative is: 


f(x) = x 3 — 2x + 2 

and take 0 as the starting point. The first iteration pro¬ 
duces 1 and the second iteration returns to 0 so the se¬ 
quence will alternate between the two without converging 
to a root. In fact, this 2-cycle is stable: there are neighbor¬ 
hoods around 0 and around 1 from which all points iterate 
asymptotically to the 2-cycle (and hence not to the root 
of the function). In general, the behavior of the sequence 
can be very complex (see Newton fractal). 


1 ifs = 0, 

1 + 2x sin (|) — 2 cos (|) ifa; 7 ^ 0. 

Within any neighborhood of the root, this derivative 
keeps changing sign as x approaches 0 from the right (or 
from the left) while fix) > x - x' 2 > 0 for 0 <x< 1. 

So f(x)/f'(x) is unbounded near the root, and Newton’s 
method will diverge almost everywhere in any neighbor¬ 
hood of it, even though: 
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• the function is differentiable (and thus continuous) except when x = 0 where it is undefined. Given x n , 
everywhere; 


• the derivative at the root is nonzero; 

• / is infinitely differentiable except at the root; and 


^n+l — %n 


/( Xn) 


3 3 

(1 + fa; n 3) 


• the derivative is bounded in a neighborhood of the 
root (unlike f(x)/f'(x)). 

17.5.3 Non-quadratic convergence 

In some cases the iterates converge but do not converge 
as quickly as promised. In these cases simpler methods 
converge just as quickly as Newton’s method. 


which has approximately 4/3 times as many bits of pre¬ 
cision as x n has. This is less than the 2 times as many 
which would be required for quadratic convergence. So 
the convergence of Newton’s method (in this case) is not 
quadratic, even though: the function is continuously dif¬ 
ferentiable everywhere; the derivative is not zero at the 
root; and /is infinitely differentiable except at the desired 
root. 


Zero derivative 

If the first derivative is zero at the root, then convergence 
will not be quadratic. Indeed, let 

f{x) = x 2 

then f'(x) = 2 xand consequently x — f(x)/f'{x) = 
x/2. So convergence is not quadratic, even though the 
function is infinitely differentiable everywhere. 

Similar problems occur even when the root is only 
“nearly” double. For example, let 

f(x) = x 2 (x — 1000) + 1. 

Then the first few iterates starting at xq = 1 are 
1, 0.500250376, 0.251062828, 0.127507934, 

0.067671976, 0.041224176, 0.032741218, 

0.031642362; it takes six iterations to reach a point 
where the convergence appears to be quadratic. 

No second derivative 

If there is no second derivative at the root, then conver¬ 
gence may fail to be quadratic. Indeed, let 

4 

f(x) = X + X 5 . 

Then 

f'{x) = l+^x*. 

And 

f"(x) = \ x ~* 


17.6 Generalizations 


17.6.1 Complex functions 



Basins of attraction for x 5 -1 = 0; darker means more iterations 
to converge. 

Main article: Newton fractal 

When dealing with complex functions, Newton’s method 
can be directly applied to find their zeroes. Each zero has 
a basin of attraction in the complex plane, the set of all 
starting values that cause the method to converge to that 
particular zero. These sets can be mapped as in the image 
shown. For many complex functions, the boundaries of 
the basins of attraction are fractals. 

In some cases there are regions in the complex plane 
which are not in any of these basins of attraction, mean¬ 
ing the iterates do not converge. For example/ 4 ^ if one 
uses a real initial condition to seek a root of x 2 + 1 , all 
subsequent iterates will be real numbers and so the itera¬ 
tions cannot converge to either root, since both roots are 
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non-real. In this case almost all real initial conditions lead 
to chaotic behavior, while some initial conditions iterate 
either to infinity or to repeating cycles of any finite length. 

17.6.2 Nonlinear systems of equations 

k variables, k functions 

One may also use Newton’s method to solve systems of k 
(non-linear) equations, which amounts to finding the ze¬ 
roes of continuously differentiable functions F : R k —» 
R*. In the formulation given above, one then has to left 
multiply with the inverse of the k-by-k Jacobian matrix 
JF(xn) instead of dividing by / '(. xn). 

Rather than actually computing the inverse of this matrix, 
one can save time by solving the system of linear equa¬ 
tions 


J F (x n )(x n+ 1 - X n ) = ~F(x n ) 

for the unknown xn+\ - xn. 

k variables, m equations, with m > k 

The k-dimensional Newton’s method can be used to solve 
systems of >k (non-linear) equations as well if the al¬ 
gorithm uses the generalized inverse of the non-square 
Jacobian matrix J + = ((J T J) - 1 )J T instead of the inverse of 
J. If the nonlinear system has no solution, the method at¬ 
tempts to find a solution in the non-linear least squares 
sense. See Gauss-Newton algorithm for more informa¬ 
tion. 

17.6.3 Nonlinear equations in a Banach 
space 

Another generalization is Newton’s method to find a root 
of a functional F defined in a Banach space. In this case 
the formulation is 


X n+1 =X n - [F'iXn^FiXn), 

where F'(X n ) is the Frechet derivative computed at X n . 
One needs the Frechet derivative to be boundedly invert¬ 
ible at each X n in order for the method to be applicable. 
A condition for existence of and convergence to a root is 
given by the Newton-Kantorovich theorem. 

17.6.4 Nonlinear equations over p-adic 
numbers 

In p-adic analysis, the standard method to show a polyno¬ 
mial equation in one variable has a p-adic root is Hensel’s 


lemma, which uses the recursion from Newton’s method 
on the p-adic numbers. Because of the more stable behav¬ 
ior of addition and multiplication in the p-adic numbers 
compared to the real numbers (specifically, the unit ball 
in the p-adics is a ring), convergence in Hensel’s lemma 
can be guaranteed under much simpler hypotheses than 
in the classical Newton’s method on the real line. 


17.6.5 Newton-Fourier method 

The Newton-Fourier method is Joseph Fourier's exten¬ 
sion of Newton’s method to provide bounds on the abso¬ 
lute error of the root approximation, while still providing 
quadratic convergence. 

Assume that /( x) is twice continuously differentiable on 
[a, b] and that / contains a root in this interval. Assume 
that f'(x)f"(x) 0 on this interval (this is the case for 

instance if /(a) < 0 , f(b) > 0 , and f(x) > 0 , and 
f"(x) > 0 on this interval). This guarantees that there is 
a unique root on this interval, call it a . If it is concave 
down instead of concave up then replace f(x) by —f(x) 
since they have the same roots. 

Let xo = b be the right endpoint of the interval and let 
Zo = a be the left endpoint of the interval. Given x n 
, define x n+ i = x n — , which is just Newton’s 

method as before. Then define z n +i = z n — and 

note that the denominator has f'(x n ) and not f'(z n ) . 
The iterates x n will be strictly decreasing to the root while 
the iterates z n will be strictly increasing to the root. Also, 
lim^oc X ( 3 !~ 1 -z rl 'p 1 = 2 /'(a) so t ^ at distance between 
x n and z n decreases quadratically. 


17.6.6 Quasi-Newton methods 

When the Jacobian is unavailable or too expensive to 
compute at every iteration, a Quasi-Newton method can 
be used. 


17.7 Applications 

17.7.1 Minimization and maximization 
problems 

Main article: Newton’s method in optimization 

Newton’s method can be used to find a minimum or max¬ 
imum of a function. The derivative is zero at a minimum 
or maximum, so minima and maxima can be found by ap¬ 
plying Newton’s method to the derivative. The iteration 
becomes: 
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Xn+l — X n . . 

f "(Xn) 

17 . 7.2 Multiplicative inverses of numbers 
and power series 


For example, if one wishes to find the square root of 612, 
this is equivalent to finding the solution to 


x 2 = 612 

The function to use in Newton’s method is then. 


An important application is Newton-Raphson division, /( x) = x 2 — 612 

which can be used to quickly find the reciprocal of a num- derivative 

her, using only multiplication and subtraction. 

Finding the reciprocal of a amounts to finding the root of 

the function f ( x ) = 2 x■ 


f(x) = a-~ 
x 

Newton’s iteration is 


x n +1 — X n 

= x n - 
— x n (2 

Therefore, Newton’s iteration needs only two multiplica¬ 
tions and one subtraction. 

This method is also very efficient to compute the multi¬ 
plicative inverse of a power series. 


f{Xn) 

f'(Xn) 



- ax n ) 


With an initial guess of 10, the sequence given by New¬ 
ton’s method is 


x\ 


X2 


X3 


X 0 


Xi 


f(x q) 

f'( x o) 
f(x l) 

f'(x l) 


X4 = 


X 5 = 


10 

35.6 


10 2 - 612 
2- 10 

35.6 2 - 612 
2-35.6 


where the correct digits are underlined. With only a few 
iterations one can obtain a solution accurate to many dec¬ 
imal places. 


17.8.2 Solution of cos(x) = r 3 


17.7.3 Solving transcendental equations 

Many transcendental equations can be solved using New¬ 
ton’s method. Given the equation 

g{x) = h(x), 

with g(x) and/or h(x) a transcendental function, one writes 


Consider the problem of finding the positive number x 
with cos(x) = x 3 . We can rephrase that as finding the 
zero of f(x ) = cosfr) - x 3 . We have/'(A) = -sin(jr) - 3x 2 . 
Since cos(x) < 1 for all x and x 3 > 1 for x > 1, we know 
that our solution lies between 0 and 1. We try a starting 
value of Xq = 0.5. (Note that a starting value of 0 will lead 
to an undefined result, showing the importance of using a 
starting point that is close to the solution.) 


f(x) = g(x) - h(x). 

The values of x that solves the original equation are then 
the roots of f(x), which may be found via Newton’s 
method. 


17.8 Examples 


X\ = *0 

x 2 = Xi 

X3 = 

X 4 = 

X 5 = 

X 6 = 


f(x q) 

f(x o) 
f(xi) 
f'(x l) 


cos(0.5) - (0.5) 3 
— sin(0.5) - 3(0.5) 


17.8.1 Square root of a number 

Consider the problem of finding the square root of a num¬ 
ber. Newton’s method is one of many methods of com¬ 
puting square roots. 


The correct digits are underlined in the above example. In 
particular, xq is correct to the number of decimal places 
given. We see that the number of correct digits after the 
decimal point increases from 2 (forxa) to 5 and 10, illus¬ 
trating the quadratic convergence. 


35.6 

26.39550561797 

24.79063549245 

24.7386 8829407 

24.73863375376 


= 1.112141 

= 0.909672 

= 0.867263 
= 0.86547 7 
= 0.865474 
= 0.865474 


















136 


CHAPTER 17. NEWTON’S METHOD 


17.9 Pseudocode 

The following is an example of using the Newton’s 
Method to help find a root of a function f which has 
derivative fprime. 

The initial guess will be xo = 1 and the function will be 

f(x) = x 2 — 2 so that f'(x) = 2x . 

Each new iterative of Newton’s method will be denoted 
by xl. We will check during the computation whether the 
denominator (yprime) becomes too small (smaller than 
epsilon), which would be the case if f{x n ) ~ 0 , since 
otherwise a large amount of error could be introduced. 

%These choices depend on the problem being solved xO 
= 1 %The initial value f = @(x) x A 2 - 2 %The function 
whose root we are trying to find fprime = @(x) 2 *x 
%The derivative of f(x) tolerance = 10 A (—7) %7 digit 
accuracy is desired epsilon = 10 A (—14) %Don't want to 
divide by a number smaller than this maxlterations = 
20 %Don't allow the iterations to continue indefinitely 
haveWeFoundSolution = false %Have not converged to 
a solution yet for i = 1 : maxlterations y = f(x 0 ) yprime 
= fprime(xO) if(abs(yprime) < epsilon) %Don't want to 
divide by too small of a number % denominator is too 
small break; %Leave the loop end xl = xO - y/yprime 
%Do Newton’s computation if(abs(xl - x0)/abs(xl) < 
tolerance) %If the result is within the desired tolerance 
haveWeFoundSolution = true break; %Done, so leave the 
loop end xO = xl %Update xO to start the process again 
end if (haveWeFoundSolution) ... % xl is a solution 
within tolerance and maximum number of iterations else 
... % did not converge end 


17.10 See also 

• Aitken’s delta-squared process 

• Bisection method 

• Euler method 

• Fast inverse square root 

• Fisher scoring 

• Gradient descent 

• Integer square root 

• Laguerre’s method 

• Leonid Kantorovich, who initiated the convergence 
analysis of Newton’s method in Banach spaces. 

• Methods of computing square roots 

• Newton’s method in optimization 

• Richardson extrapolation 


• Root-finding algorithm 

• Secant method 

• Steffensen’s method 

• Subgradient method 
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Chapter 18 

Supervised learning 


See also: Unsupervised learning 

Supervised learning is the machine learning task of 
inferring a function from labeled training data. 1 ' 1 The 
training data consist of a set of training examples. In su¬ 
pervised learning, each example is a pair consisting of 
an input object (typically a vector) and a desired output 
value (also called the supervisory signal). A supervised 
learning algorithm analyzes the training data and pro¬ 
duces an inferred function, which can be used for map¬ 
ping new examples. An optimal scenario will allow for 
the algorithm to correctly determine the class labels for 
unseen instances. This requires the learning algorithm to 
generalize from the training data to unseen situations in a 
“reasonable” way (see inductive bias). 

The parallel task in human and animal psychology is often 
referred to as concept learning. 

18.1 Overview 

In order to solve a given problem of supervised learning, 
one has to perform the following steps: 

1. Determine the type of training examples. Before 
doing anything else, the user should decide what 
kind of data is to be used as a training set. In the case 
of handwriting analysis, for example, this might be a 
single handwritten character, an entire handwritten 
word, or an entire line of handwriting. 

2. Gather a training set. The training set needs to be 
representative of the real-world use of the function. 
Thus, a set of input objects is gathered and corre¬ 
sponding outputs are also gathered, either from hu¬ 
man experts or from measurements. 

3. Determine the input feature representation of the 
learned function. The accuracy of the learned func¬ 
tion depends strongly on how the input object is rep¬ 
resented. Typically, the input object is transformed 
into a feature vector, which contains a number of 
features that are descriptive of the object. The num¬ 
ber of features should not be too large, because 


of the curse of dimensionality; but should contain 
enough information to accurately predict the output. 

4. Determine the structure of the learned function and 
corresponding learning algorithm. For example, the 
engineer may choose to use support vector machines 
or decision trees. 

5. Complete the design. Run the learning algorithm on 
the gathered training set. Some supervised learn¬ 
ing algorithms require the user to determine cer¬ 
tain control parameters. These parameters may be 
adjusted by optimizing performance on a subset 
(called a validation set) of the training set, or via 
cross-validation. 

6. Evaluate the accuracy of the learned function. After 
parameter adjustment and learning, the performance 
of the resulting function should be measured on a 
test set that is separate from the training set. 

A wide range of supervised learning algorithms is avail¬ 
able, each with its strengths and weaknesses. There is no 
single learning algorithm that works best on all supervised 
learning problems (see the No free lunch theorem). 

There are four major issues to consider in supervised 
learning: 

18.1.1 Bias-variance tradeoff 

Main article: Bias-variance dilemma 

A first issue is the tradeoff between bias and varianceS 2 ^ 
Imagine that we have available several different, but 
equally good, training data sets. A learning algorithm is 
biased for a particular input x if, when trained on each 
of these data sets, it is systematically incorrect when pre¬ 
dicting the correct output for x . A learning algorithm 
has high variance for a particular input x if it predicts 
different output values when trained on different training 
sets. The prediction error of a learned classifier is related 
to the sum of the bias and the variance of the learning 
algorithm. 131 Generally, there is a tradeoff between bias 
and variance. A learning algorithm with low bias must 
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be “flexible” so that it can fit the data well. But if the 
learning algorithm is too flexible, it will fit each training 
data set differently, and hence have high variance. A key 
aspect of many supervised learning methods is that they 
are able to adjust this tradeoff between bias and variance 
(either automatically or by providing a bias/variance pa¬ 
rameter that the user can adjust). 

18.1.2 Function complexity and amount of 
training data 

The second issue is the amount of training data available 
relative to the complexity of the “true” function (classi¬ 
fier or regression function). If the true function is simple, 
then an “inflexible” learning algorithm with high bias and 
low variance will be able to learn it from a small amount 
of data. But if the true function is highly complex (e.g., 
because it involves complex interactions among many dif¬ 
ferent input features and behaves differently in different 
parts of the input space), then the function will only be 
learnable from a very large amount of training data and 
using a “flexible” learning algorithm with low bias and 
high variance. Good learning algorithms therefore auto¬ 
matically adjust the bias/variance tradeoff based on the 
amount of data available and the apparent complexity of 
the function to be learned. 

18.1.3 Dimensionality of the input space 

A third issue is the dimensionality of the input space. If 
the input feature vectors have very high dimension, the 
learning problem can be difficult even if the true func¬ 
tion only depends on a small number of those features. 
This is because the many “extra” dimensions can con¬ 
fuse the learning algorithm and cause it to have high vari¬ 
ance. Hence, high input dimensionality typically requires 
tuning the classifier to have low variance and high bias. 
In practice, if the engineer can manually remove irrel¬ 
evant features from the input data, this is likely to im¬ 
prove the accuracy of the learned function. In addition, 
there are many algorithms for feature selection that seek 
to identify the relevant features and discard the irrelevant 
ones. This is an instance of the more general strategy of 
dimensionality reduction, which seeks to map the input 
data into a lower-dimensional space prior to running the 
supervised learning algorithm. 

18.1.4 Noise in the output values 

A fourth issue is the degree of noise in the desired output 
values (the supervisory target variables). If the desired 
output values are often incorrect (because of human er¬ 
ror or sensor errors), then the learning algorithm should 
not attempt to find a function that exactly matches the 
training examples. Attempting to fit the data too carefully 
leads to overfitting. You can overfit even when there are 


no measurement errors (stochastic noise) if the function 
you are trying to learn is too complex for your learning 
model. In such a situation that part of the target function 
that cannot be modeled “corrupts” your training data - this 
phenomenon has been called deterministic noise. When 
either type of noise is present, it is better to go with a 
higher bias, lower variance estimator. 

In practice, there are several approaches to alleviate noise 
in the output values such as early stopping to prevent 
overfitting as well as detecting and removing the noisy 
training examples prior to training the supervised learn¬ 
ing algorithm. There are several algorithms that iden¬ 
tify noisy training examples and removing the suspected 
noisy training examples prior to training has decreased 
generalization error with statistical significance. 141151 

18.1.5 Other factors to consider 

Other factors to consider when choosing and applying a 
learning algorithm include the following: 

1. Heterogeneity of the data. If the feature vectors in¬ 
clude features of many different kinds (discrete, dis¬ 
crete ordered, counts, continuous values), some al¬ 
gorithms are easier to apply than others. Many algo¬ 
rithms, including Support Vector Machines, linear 
regression, logistic regression, neural networks, and 
nearest neighbor methods, require that the input fea¬ 
tures be numerical and scaled to similar ranges (e.g., 
to the [—1,1] interval). Methods that employ a dis¬ 
tance function, such as nearest neighbor methods 
and support vector machines with Gaussian kernels, 
are particularly sensitive to this. An advantage of 
decision trees is that they easily handle heteroge¬ 
neous data. 

2. Redundancy in the data. If the input features contain 
redundant information (e.g., highly correlated fea¬ 
tures), some learning algorithms (e.g., linear regres¬ 
sion, logistic regression, and distance based meth¬ 
ods) will perform poorly because of numerical in¬ 
stabilities. These problems can often be solved by 
imposing some form of regularization. 

3. Presence of interactions and non-linearities. If each 
of the features makes an independent contribution to 
the output, then algorithms based on linear functions 
(e.g., linear regression, logistic regression. Support 
Vector Machines, naive Bayes) and distance func¬ 
tions (e.g., nearest neighbor methods, support vector 
machines with Gaussian kernels) generally perform 
well. However, if there are complex interactions 
among features, then algorithms such as decision 
trees and neural networks work better, because they 
are specifically designed to discover these interac¬ 
tions. Linear methods can also be applied, but 
the engineer must manually specify the interactions 
when using them. 
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When considering a new application, the engineer can 
compare multiple learning algorithms and experimentally 
determine which one works best on the problem at hand 
(see cross validation). Tuning the performance of a learn¬ 
ing algorithm can be very time-consuming. Given fixed 
resources, it is often better to spend more time collect¬ 
ing additional training data and more informative features 
than it is to spend extra time tuning the learning algo¬ 
rithms. 

The most widely used learning algorithms are Support 
Vector Machines, linear regression, logistic regression, 
naive Bayes, linear discriminant analysis, decision trees, 
k-nearest neighbor algorithm, and Neural Networks 
(Multilayer perceptron). 


18.2 How supervised learning algo¬ 
rithms work 

Given a set of N training examples of the form 
{(xi, t/i), (xn, Un )} such that Xi is the feature vec¬ 
tor of the i-th example and tji is its label (i.e., class), a 
learning algorithm seeks a function g : X Y , where 
X is the input space and Y is the output space. The func¬ 
tion g is an element of some space of possible functions 
G , usually called the hypothesis space. It is sometimes 
convenient to represent g using a scoring function / : 
X x Y —> M such that g is defined as returning the y value 
that gives the highest score: g{x) = argmax y f(x,y) . 
Let F denote the space of scoring functions. 

Although G and F can be any space of functions, many 
learning algorithms are probabilistic models where g 
takes the form of a conditional probability model g( x) 
P(y\x ), or / takes the form of a joint probability model 
f(x,y) = P(x,y) . For example, naive Bayes and 
linear discriminant analysis are joint probability mod¬ 
els, whereas logistic regression is a conditional probability 
model. 

There are two basic approaches to choosing / or g : 
empirical risk minimization and structural risk minimiza¬ 
tion. 161 Empirical risk minimization seeks the function 
that best fits the training data. Structural risk minimize 
includes a penalty function that controls the bias/variance 
tradeoff. 

In both cases, it is assumed that the training set consists of 
a sample of independent and identically distributed pairs, 
(. Xi, y,) . In order to measure how well a function fits 
the training data, a loss function L : Y x Y —> M-° 
is defined. For training example ( Xi , yf) , the loss of 
predicting the value y is L(yi, y) . 

The risk R(g) of function g is defined as the expected loss 
of g . This can be estimated from the training data as 


Rempilj) — Ar 'y ' g(xj)) 

i 

18.2.1 Empirical risk minimization 

Main article: Empirical risk minimization 

In empirical risk minimization, the supervised learn¬ 
ing algorithm seeks the function g that minimizes I Kg') 
. Hence, a supervised learning algorithm can be con¬ 
structed by applying an optimization algorithm to find g 

When g is a conditional probability distribution P{y\x) 
and the loss function is the negative log likelihood: 
L(y, y) = — log P{y\x) , then empirical risk minimiza¬ 
tion is equivalent to maximum likelihood estimation. 

When G contains many candidate functions or the train¬ 
ing set is not sufficiently large, empirical risk minimiza¬ 
tion leads to high variance and poor generalization. The 
learning algorithm is able to memorize the training exam¬ 
ples without generalizing well. This is called overfitting. 

18.2.2 Structural risk minimization 

Structural risk minimization seeks to prevent overfitting 
by incorporating a regularization penalty into the opti¬ 
mization. The regularization penalty can be viewed as 
implementing a form of Occam’s razor that prefers sim¬ 
pler functions over more complex ones. 

A wide variety of penalties have been employed that cor¬ 
respond to different definitions of complexity. For ex¬ 
ample, consider the case where the function g is a linear 
function of the form 


d 

9{x) = ^Pj x j 

j=i 

A popular regularization penalty is 3‘j , which is the 
squared Euclidean norm of the weights, also known as the 
L 2 norm. Other norms include the L\ norm, | f :) | ; 

and the Lq norm, which is the number of non-zero If s. 
The penalty will be denoted by C(g) . 

The supervised learning optimization problem is to find 
the function g that minimizes 


J(g) — Remp(g) T- ^C(g). 

The parameter A controls the bias-variance tradeoff. 
When A = 0 , this gives empirical risk minimization with 
low bias and high variance. When A is large, the learning 
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algorithm will have high bias and low variance. The value 
of A can be chosen empirically via cross validation. 

The complexity penalty has a Bayesian interpretation as 
the negative log prior probability of g , — log P{g) , in 
which case J(g) is the posterior probabability of g . 

18.3 Generative training 

The training methods described above are discriminative 
training methods, because they seek to find a function 
g that discriminates well between the different output 
values (see discriminative model). For the special case 
where f(x, y) = P(x, y) is a joint probability distribu¬ 
tion and the loss function is the negative log likelihood 
— l°g P(xi , Vi), a risk minimization algorithm is said 
to perform generative training, because / can be regarded 
as a generative model that explains how the data were gen¬ 
erated. Generative training algorithms are often simpler 
and more computationally efficient than discriminative 
training algorithms. In some cases, the solution can be 
computed in closed form as in naive Bayes and linear dis¬ 
criminant analysis. 

18.4 Generalizations of supervised 
learning 

There are several ways in which the standard supervised 
learning problem can be generalized: 

1. Semi-supervised learning: In this setting, the de¬ 
sired output values are provided only for a subset of 
the training data. The remaining data is unlabeled. 

2. Active learning: Instead of assuming that all of the 
training examples are given at the start, active learn¬ 
ing algorithms interactively collect new examples, 
typically by making queries to a human user. Of¬ 
ten, the queries are based on unlabeled data, which 
is a scenario that combines semi-supervised learning 
with active learning. 

3. Structured prediction: When the desired output 
value is a complex object, such as a parse tree or 
a labeled graph, then standard methods must be ex¬ 
tended. 

4. Learning to rank: When the input is a set of objects 
and the desired output is a ranking of those objects, 
then again the standard methods must be extended. 


18.5 Approaches and algorithms 

• Analytical learning 


Artificial neural network 
Backpropagation 
Boosting (meta-algorithm) 

Bayesian statistics 
Case-based reasoning 
Decision tree learning 
Inductive logic programming 
Gaussian process regression 
Group method of data handling 
Kernel estimators 
Learning Automata 

Minimum message length (decision trees, decision 
graphs, etc.) 

Multilinear subspace learning 

Naive bayes classifier 

Nearest Neighbor Algorithm 

Probably approximately correct learning (PAC) 
learning 

Ripple down rules, a knowledge acquisition method¬ 
ology 

Symbolic machine learning algorithms 

Subsymbolic machine learning algorithms 

Support vector machines 

Random Forests 

Ensembles of Classifiers 

Ordinal classification 

Data Pre-processing 

Handling imbalanced datasets 

Statistical relational learning 

Proaftn, a multicriteria classification algorithm 
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18.6 Applications 18.9 External links 

• Bioinformatics • mloss.org: a directory of open source machine 

learning software. 

• Cheminformatics 

• Quantitative structure-activity relationship 

• Database marketing 

• Handwriting recognition 

• Information retrieval 

• Learning to rank 

• Object recognition in computer vision 

• Optical character recognition 

• Spam detection 

• Pattern recognition 

• Speech recognition 

18.7 General issues 

• Computational learning theory 

• Inductive bias 

• Overfitting (machine learning) 

• (Uncalibrated) Class membership probabilities 

• Version spaces 
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Chapter 19 

Linear regression 


In statistics, linear regression is an approach for mod¬ 
eling the relationship between a scalar dependent vari¬ 
able y and one or more explanatory variables (or inde¬ 
pendent variable) denoted X. The case of one explana¬ 
tory variable is called simple linear regression. For more 
than one explanatory variable, the process is called multi¬ 
ple linear regression. 1 1 (This term should be distinguished 
from multivariate linear regression, where multiple cor¬ 
related dependent variables are predicted, rather than a 
single scalar variable.)^ 

In Unear regression, data are modeled using linear pre¬ 
dictor functions, and unknown model parameters are 
estimated from the data. Such models are called linear 
models J 31 Most commonly, linear regression refers to a 
model in which the conditional mean of y given the value 
of X is an affine function of X. Less commonly. Unear 
regression could refer to a model in which the median, 
or some other quantile of the conditional distribution of 
y given X is expressed as a linear function of X. Like 
aU forms of regression analysis, linear regression focuses 
on the conditional probability distribution of y given X, 
rather than on the joint probability distribution of y and 
X, which is the domain of multivariate analysis. 

Linear regression was the first type of regression analy¬ 
sis to be studied rigorously, and to be used extensively 
in practical applications. 141 This is because models which 
depend linearly on their unknown parameters are easier 
to fit than models which are non-linearly related to their 
parameters and because the statistical properties of the 
resulting estimators are easier to determine. 

Linear regression has many practical uses. Most applica- 
tions fall into one of the foUowing two broad categories: 

• If the goal is prediction, or forecasting, or reduction. 
Unear regression can be used to fit a predictive model 
to an observed data set of y and X values. After 
developing such a model, if an additional value of X 
is then given without its accompanying value of y, 
the fitted model can be used to make a prediction of 
the value of y. 

• Given a variable y and a number of variables X t ,..., 
Xp that may be related to y, linear regression analy¬ 
sis can be applied to quantify the strength of the re¬ 


lationship between y and the Xj, to assess which Xj 
may have no relationship with y at all, and to identify 
which subsets of the Xj contain redundant informa¬ 
tion about y. 

Linear regression models are often fitted using the least 
squares approach, but they may also be fitted in other 
ways, such as by minimizing the “lack of fit” in some 
other norm (as with least absolute deviations regression), 
or by minimizing a penalized version of the least squares 
loss function as in ridge regression (L2-norm penalty) and 
lasso (Ll-norm penalty). Conversely, the least squares 
approach can be used to fit models that are not linear 
models. Thus, although the terms “least squares” and “lin¬ 
ear model” are closely Unked, they are not synonymous. 

19.1 Introduction to linear regres¬ 
sion 



Example of simple linear regression, which has one independent 
variable 

Given a data set {yi, Xu ,... ,Xi P }" =1 of n statistical 
units, a linear regression model assumes that the relation¬ 
ship between the dependent variable yi and the p-vector 
of regressors xi is linear. This relationship is modeled 
through a disturbance term or error variable ei — an un¬ 
observed random variable that adds noise to the linear re¬ 
lationship between the dependent variable and regressors. 
Thus the model takes the form 
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X 

Example of a cubic polynomial regression, which is a type of 
linear regression. 


Vi Pl^il T‘ * *4“ ftpXip-fEi (3-fEi , % 1, . . . , 71 

where T denotes the transpose, so that xffl is the inner 
product between vectors xi and f>. 

Often these n equations are stacked together and written 
in vector form as 


not to be confused with independent random vari¬ 
ables). The matrix X is sometimes called the design 
matrix. 

• Usually a constant is included as one of the re¬ 
gressors. For example we can take xi\ = 1 for 
i=l, ..., n. The corresponding element of f> 
is called the intercept. Many statistical infer¬ 
ence procedures for linear models require an 
intercept to be present, so it is often included 
even if theoretical considerations suggest that 
its value should be zero. 

• Sometimes one of the regressors can be a 
non-linear function of another regressor or 
of the data, as in polynomial regression and 
segmented regression. The model remains lin¬ 
ear as long as it is linear in the parameter vec¬ 
tor ji. 

• The regressors xij may be viewed either as 
random variables, which we simply observe, or 
they can be considered as predetermined fixed 
values which we can choose. Both interpre¬ 
tations may be appropriate in different cases, 
and they generally lead to the same estimation 
procedures; however different approaches to 
asymptotic analysis are used in these two situ¬ 
ations. 


y — x/3 + e, 

where 
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/ Xu • 


y = 

2/2 

, X = 
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Some remarks on terminology and general use: 

• yi is called the regressand , endogenous variable, re¬ 
sponse variable, measured variable, criterion vari¬ 
able, or dependent variable (see dependent and in¬ 
dependent variables.) The decision as to which vari¬ 
able in a data set is modeled as the dependent vari¬ 
able and which are modeled as the independent vari¬ 
ables may be based on a presumption that the value 
of one of the variables is caused by, or directly in¬ 
fluenced by the other variables. Alternatively, there 
may be an operational reason to model one of the 
variables in terms of the others, in which case there 
need be no presumption of causality. 


• f3 is a /^-dimensional parameter vector. Its elements 
are also called effects, or regression coefficients. Sta¬ 
tistical estimation and inference in linear regression 
focuses on j>. The elements of this parameter vec¬ 
tor are interpreted as the partial derivatives of the 
dependent variable with respect to the various inde- 
( ,/fl indent varkyblejs^ 

£2 

e err tr term, disturbance term, or noise. 


P 2 
' £<. 


iij called tl 
This variable 
\i/p(:e the depepdej* 


capt 


t ires all other factors which influ- 
at variable yi other than the regres¬ 
sors xi. The relationship between the error term 
and the regressors, for example whether they are 
correlated, is a crucial step in formulating a linear 
regression model, as it will determine the method to 
use for estimation. 


Example. Consider a situation where a small ball is being 
tossed up in the air and then we measure its heights of 
ascent hi at various moments in time ft. Physics tells us 
that, ignoring the drag, the relationship can be modeled 
as 


hi — PiU + + Si, 


• Xu, Xi 2 , ..., Xi P are called regressors, exogenous 
variables, explanatory variables, covariates, input 
variables, predictor variables, or independent vari¬ 
ables (see dependent and independent variables, but 


where f t determines the initial velocity of the ball, /U 
is proportional to the standard gravity, and ei is due to 
measurement errors. Linear regression can be used to es¬ 
timate the values of /Si and /3 2 from the measured data. 
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This model is non-linear in the time variable, but it is lin¬ 
ear in the parameters /Si and / > 2 ; if we take regressors xi = 
to' 1 , XI 2 ) = ( ti , fr ), the model takes on the standard form 


hi = xjf3 + £i- 

19.1.1 Assumptions 

Standard linear regression models with standard estima¬ 
tion techniques make a number of assumptions about the 
predictor variables, the response variables and their rela¬ 
tionship. Numerous extensions have been developed that 
allow each of these assumptions to be relaxed (i.e. re¬ 
duced to a weaker form), and in some cases eliminated 
entirely. Some methods are general enough that they can 
relax multiple assumptions at once, and in other cases 
this can be achieved by combining different extensions. 
Generally these extensions make the estimation proce¬ 
dure more complex and time-consuming, and may also 
require more data in order to produce an equally precise 
model. 

The following are the major assumptions made by stan¬ 
dard linear regression models with standard estimation 
techniques (e.g. ordinary least squares): 

• Weak exogeneity. This essentially means that the 
predictor variables x can be treated as fixed values, 
rather than random variables. This means, for ex¬ 
ample, that the predictor variables are assumed to be 
error-free—that is, not contaminated with measure¬ 
ment errors. Although this assumption is not realis¬ 
tic in many settings, dropping it leads to significantly 
more difficult errors-in-variables models. 

• Linearity. This means that the mean of the re¬ 
sponse variable is a linear combination of the param¬ 
eters (regression coefficients) and the predictor vari¬ 
ables. Note that this assumption is much less restric¬ 
tive than it may at first seem. Because the predictor 
variables are treated as fixed values (see above), lin¬ 
earity is really only a restriction on the parameters. 
The predictor variables themselves can be arbitrarily 
transformed, and in fact multiple copies of the same 
underlying predictor variable can be added, each one 
transformed differently. This trick is used, for ex¬ 
ample, in polynomial regression, which uses linear 
regression to fit the response variable as an arbitrary 
polynomial function (up to a given rank) of a pre¬ 
dictor variable. This makes linear regression an ex¬ 
tremely powerful inference method. In fact, models 
such as polynomial regression are often “too power¬ 
ful”, in that they tend to overfit the data. As a re¬ 
sult, some kind of regularization must typically be 
used to prevent unreasonable solutions coming out 
of the estimation process. Common examples are 


ridge regression and lasso regression. Bayesian lin¬ 
ear regression can also be used, which by its nature 
is more or less immune to the problem of overfit¬ 
ting. (In fact, ridge regression and lasso regression 
can both be viewed as special cases of Bayesian lin¬ 
ear regression, with particular types of prior distri¬ 
butions placed on the regression coefficients.) 

• Constant variance (a.k.a. homoscedasticity). 

This means that different response variables have 
the same variance in their errors, regardless of 
the values of the predictor variables. In prac¬ 
tice this assumption is invalid (i.e. the errors are 
heteroscedastic) if the response variables can vary 
over a wide scale. In order to determine for hetero¬ 
geneous error variance, or when a pattern of resid¬ 
uals violates model assumptions of homoscedastic¬ 
ity (error is equally variable around the 'best-fitting 
line' for all points of x), it is prudent to look for 
a “fanning effect” between residual error and pre¬ 
dicted values. This is to say there will be a system¬ 
atic change in the absolute or squared residuals when 
plotted against the predicting outcome. Error will 
not be evenly distributed across the regression line. 
Heteroscedasticity will result in the averaging over 
of distinguishable variances around the points to get 
a single variance that is inaccurately representing all 
the variances of the line. In effect, residuals appear 
clustered and spread apart on their predicted plots 
for larger and smaller values for points along the lin¬ 
ear regression line, and the mean squared error for 
the model will be wrong. Typically, for example, 
a response variable whose mean is large will have 
a greater variance than one whose mean is small. 
For example, a given person whose income is pre¬ 
dicted to be $100,000 may easily have an actual in¬ 
come of $80,000 or $120,000 (a standard devia¬ 
tion of around $20,000), while another person with 
a predicted income of $10,000 is unlikely to have 
the same $20,000 standard deviation, which would 
imply their actual income would vary anywhere be¬ 
tween -$10,000 and $30,000. (In fact, as this shows, 
in many cases—often the same cases where the as¬ 
sumption of normally distributed errors fails—the 
variance or standard deviation should be predicted 
to be proportional to the mean, rather than con¬ 
stant.) Simple linear regression estimation meth¬ 
ods give less precise parameter estimates and mis¬ 
leading inferential quantities such as standard errors 
when substantial heteroscedasticity is present. How¬ 
ever, various estimation techniques (e.g. weighted 
least squares and heteroscedasticity-consistent stan¬ 
dard errors) can handle heteroscedasticity in a quite 
general way. Bayesian linear regression techniques 
can also be used when the variance is assumed to be 
a function of the mean. It is also possible in some 
cases to fix the problem by applying a transforma¬ 
tion to the response variable (e.g. fit the logarithm 


146 


CHAPTER 19. LINEAR REGRESSION 


of the response variable using a linear regression 
model, which implies that the response variable has 
a log-normal distribution rather than a normal dis¬ 
tribution). 

• Independence of errors. This assumes that the er¬ 
rors of the response variables are uncorrelated with 
each other. (Actual statistical independence is a 
stronger condition than mere lack of correlation and 
is often not needed, although it can be exploited if it 
is known to hold.) Some methods (e.g. generalized 
least squares) are capable of handling correlated 
errors, although they typically require significantly 
more data unless some sort of regularization is used 
to bias the model towards assuming uncorrelated er¬ 
rors. Bayesian linear regression is a general way of 
handling this issue. 

• Lack of multicollinearity in the predictors. For 
standard least squares estimation methods, the de¬ 
sign matrix X must have full column rank p,\ other¬ 
wise, we have a condition known as multicollinearity 
in the predictor variables. This can be triggered 
by having two or more perfectly correlated predic¬ 
tor variables (e.g. if the same predictor variable is 
mistakenly given twice, either without transform¬ 
ing one of the copies or by transforming one of the 
copies linearly). It can also happen if there is too 
little data available compared to the number of pa¬ 
rameters to be estimated (e.g. fewer data points 
than regression coefficients). In the case of mul¬ 
ticollinearity, the parameter vector /S will be non- 
identifiable —it has no unique solution. At most we 
will be able to identify some of the parameters, i.e. 
narrow down its value to some linear subspace of 
RC See partial least squares regression. Methods 
for fitting linear models with multicollinearity have 
been developed; |5||6||7||x| some require additional 
assumptions such as “effect sparsity”—that a large 
fraction of the effects are exactly zero. Note that 
the more computationally expensive iterated algo¬ 
rithms for parameter estimation, such as those used 
in generalized linear models, do not suffer from this 
problem—and in fact it’s quite normal to when han¬ 
dling categorically valued predictors to introduce a 
separate indicator variable predictor for each pos¬ 
sible category, which inevitably introduces multi¬ 
collinearity. 

Beyond these assumptions, several other statistical prop¬ 
erties of the data strongly influence the performance of 
different estimation methods: 

• The statistical relationship between the error terms 
and the regressors plays an important role in deter¬ 
mining whether an estimation procedure has desir¬ 
able sampling properties such as being unbiased and 
consistent. 


• The arrangement, or probability distribution of the 
predictor variables x has a major influence on the 
precision of estimates of (3. Sampling and design of 
experiments are highly developed subfields of statis¬ 
tics that provide guidance for collecting data in such 
a way to achieve a precise estimate of /3. 

19.1.2 Interpretation 




The sets in the Anscombe’s quartet have the same linear regres¬ 
sion line but are themselves very different. 

A fitted linear regression model can be used to identify the 
relationship between a single predictor variable xj and the 
response variable y when all the other predictor variables 
in the model are “held fixed”. Specifically, the interpre¬ 
tation of /3j is the expected change in y for a one-unit 
change in xj when the other covariates are held fixed— 
that is, the expected value of the partial derivative of y 
with respect to xj. This is sometimes called the unique 
effect of xj on y. In contrast, the marginal effect of xj on 
y can be assessed using a correlation coefficient or simple 
linear regression model relating xj to y; this effect is the 
total derivative of y with respect to xj. 

Care must be taken when interpreting regression results, 
as some of the regressors may not allow for marginal 
changes (such as dummy variables, or the intercept term), 
while others cannot be held fixed (recall the example from 
the introduction: it would be impossible to “hold ti fixed” 
and at the same time change the value of ti 2 ). 

It is possible that the unique effect can be nearly zero even 
when the marginal effect is large. This may imply that 
some other covariate captures all the information in xj, so 
that once that variable is in the model, there is no contri¬ 
bution of xj to the variation in y. Conversely, the unique 
effect of xj can be large while its marginal effect is nearly 
zero. This would happen if the other covariates explained 
a great deal of the variation of y, but they mainly explain 
variation in a way that is complementary to what is cap¬ 
tured by xj. In this case, including the other variables in 
the model reduces the part of the variability of y that is 
unrelated to xj, thereby strengthening the apparent rela¬ 
tionship with xj. 
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The meaning of the expression “held fixed” may depend 
on how the values of the predictor variables arise. If 
the experimenter directly sets the values of the predic¬ 
tor variables according to a study design, the compar¬ 
isons of interest may literally correspond to comparisons 
among units whose predictor variables have been “held 
fixed” by the experimenter. Alternatively, the expression 
“held fixed” can refer to a selection that takes place in the 
context of data analysis. In this case, we “hold a variable 
fixed” by restricting our attention to the subsets of the data 
that happen to have a common value for the given predic¬ 
tor variable. This is the only interpretation of “held fixed” 
that can be used in an observational study. 

The notion of a “unique effect” is appealing when study¬ 
ing a complex system where multiple interrelated com¬ 
ponents influence the response variable. In some cases, 
it can literally be interpreted as the causal effect of an 
intervention that is linked to the value of a predictor vari¬ 
able. However, it has been argued that in many cases mul¬ 
tiple regression analysis fails to clarify the relationships 
between the predictor variables and the response variable 
when the predictors are correlated with each other and are 
not assigned following a study design. 191 A commonality 
analysis may be helpful in disentangling the shared and 
unique impacts of correlated independent variables. 1101 

19.2 Extensions 

Numerous extensions of linear regression have been de¬ 
veloped, which allow some or all of the assumptions un¬ 
derlying the basic model to be relaxed. 

19.2.1 Simple and multiple regression 

The very simplest case of a single scalar predictor vari¬ 
able x and a single scalar response variable y is known as 
simple linear regression. The extension to multiple and/or 
vector-valued predictor variables (denoted with a capital 
X) is known as multiple linear regression , also known as 
multivariable linear regression. Nearly all real-world re¬ 
gression models involve multiple predictors, and basic de¬ 
scriptions of linear regression are often phrased in terms 
of the multiple regression model. Note, however, that in 
these cases the response variable y is still a scalar. An¬ 
other term multivariate linear regression refers to cases 
where y is a vector, i.e., the same as general linear regres¬ 
sion. The difference between multivariate linear regres¬ 
sion and multivariable linear regression should be empha¬ 
sized as it causes much confusion and misunderstanding 
in the literature. 

19.2.2 General linear models 

The general linear model considers the situation when the 
response variable Y is not a scalar but a vector. Con¬ 


ditional linearity of E(ylx) = Bx is still assumed, with a 
matrix B replacing the vector /3 of the classical linear re¬ 
gression model. Multivariate analogues of OLS and GLS 
have been developed. The term “general linear models” 
is equivalent to “multivariate linear models”. It should 
be noted the difference of “multivariate linear models” 
and “multivariable linear models,” where the former is 
the same as “general linear models” and the latter is the 
same as “multiple linear models.” 

19.2.3 Heteroscedastic models 

Various models have been created that allow for 
heteroscedasticity, i.e. the errors for different response 
variables may have different variances. For example, 
weighted least squares is a method for estimating linear 
regression models when the response variables may have 
different error variances, possibly with correlated errors. 
(See also Weighted linear least squares, and generalized 
least squares.) Heteroscedasticity-consistent standard er¬ 
rors is an improved method for use with uncorrelated but 
potentially heteroscedastic errors. 

19.2.4 Generalized linear models 

Generalized linear models (GLMs) are a framework for 
modeling a response variable y that is bounded or dis¬ 
crete. This is used, for example: 

• when modeling positive quantities (e.g. prices or 
populations) that vary over a large scale—which are 
better described using a skewed distribution such as 
the log-normal distribution or Poisson distribution 
(although GLMs are not used for log-normal data, 
instead the response variable is simply transformed 
using the logarithm function); 

• when modeling categorical data, such as the choice 
of a given candidate in an election (which is bet¬ 
ter described using a Bernoulli distribution/binomial 
distribution for binary choices, or a categorical 
distribution/multinomial distribution for multi-way 
choices), where there are a fixed number of choices 
that cannot be meaningfully ordered; 

• when modeling ordinal data, e.g. ratings on a scale 
from 0 to 5, where the different outcomes can be 
ordered but where the quantity itself may not have 
any absolute meaning (e.g. a rating of 4 may not be 
“twice as good” in any objective sense as a rating of 
2, but simply indicates that it is better than 2 or 3 
but not as good as 5). 

Generalized linear models allow for an arbitrary link 
function g that relates the mean of the response variable 
to the predictors, i.e. E(y) = g(fi'x). The link function is 
often related to the distribution of the response, and in 
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particular it typically has the effect of transforming be¬ 
tween the (— 00 , 00 ) range of the Unear predictor and the 
range of the response variable. 

Some common examples of GLMs are: 

• Poisson regression for count data. 

• Logistic regression and probit regression for binary 
data. 

• Multinomial logistic regression and multinomial 
probit regression for categorical data. 

• Ordered probit regression for ordinal data. 

Single index models allow some degree of nonlinearity 
in the relationship between x and y, while preserving the 
central role of the linear predictor fi'x as in the classical 
linear regression model. Under certain conditions, simply 
applying OLS to data from a single-index model will con¬ 
sistently estimate /S up to a proportionality constant. 1111 

19.2.5 Hierarchical linear models 

Hierarchical linear models (or multilevel regression ) or¬ 
ganizes the data into a hierarchy of regressions, for ex¬ 
ample where A is regressed on B , and B is regressed on 
C. It is often used where the data have a natural hierar¬ 
chical structure such as in educational statistics, where 
students are nested in classrooms, classrooms are nested 
in schools, and schools are nested in some administrative 
grouping, such as a school district. The response variable 
might be a measure of student achievement such as a test 
score, and different covariates would be collected at the 
classroom, school, and school district levels. 

19.2.6 Errors-in-variables 

Errors-in-variables models (or “measurement error mod¬ 
els”) extend the traditional linear regression model to al¬ 
low the predictor variables X to be observed with error. 
This error causes standard estimators of /S to become bi¬ 
ased. GeneraUy, the form of bias is an attenuation, mean¬ 
ing that the effects are biased toward zero. 

19.2.7 Others 

• In Dempster-Shafer theory, or a linear belief func¬ 
tion in particular, a linear regression model may be 
represented as a partially swept matrix, which can 
be combined with similar matrices representing ob¬ 
servations and other assumed normal distributions 
and state equations. The combination of swept or 
unswept matrices provides an alternative method for 
estimating linear regression models. 


19.3 Estimation methods 



Comparison of the Theil-Sen estimator (black) and simple linear 
regression (blue) for a set of points with outliers. 

A large number of procedures have been developed for 
parameter estimation and inference in linear regression. 
These methods differ in computational simplicity of al¬ 
gorithms, presence of a closed-form solution, robustness 
with respect to heavy-tailed distributions, and theoretical 
assumptions needed to validate desirable statistical prop¬ 
erties such as consistency and asymptotic efficiency. 

Some of the more common estimation techniques for lin¬ 
ear regression are summarized below. 

19.3.1 Least-squares estimation and re¬ 
lated techniques 

• Ordinary least squares (OLS) is the simplest 
and thus most common estimator. It is concep¬ 
tually simple and computationally straightforward. 
OLS estimates are commonly used to analyze both 
experimental and observational data. 

The OLS method minimizes the sum of squared 
residuals, and leads to a closed-form expression for 
the estimated value of the unknown parameter /S: 

P = (x T x) -1 x T y = (E^T'tE**)- 

The estimator is unbiased and consistent if the errors 
have finite variance and are uncorrelated with the 

regressors 1121 


E[x i£i ] = 0. 
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It is also efficient under the assumption that the 
errors have finite variance and are homoscedastic, 
meaning that E[ez 2 lxz] does not depend on i. The 
condition that the errors are uncorrelated with the 
regressors will generally be satisfied in an experi¬ 
ment, but in the case of observational data, it is dif¬ 
ficult to exclude the possibility of an omitted covari¬ 
ate z that is related to both the observed covariates 
and the response variable. The existence of such a 
covariate will generally lead to a correlation between 
the regressors and the response variable, and hence 
to an inconsistent estimator of (>. The condition of 
homoscedasticity can fail with either experimental 
or observational data. If the goal is either inference 
or predictive modeling, the performance of OLS es¬ 
timates can be poor if multicollinearity is present, 
unless the sample size is large. 

In simple linear regression, where there is only one 
regressor (with a constant), the OLS coefficient es¬ 
timates have a simple form that is closely related to 
the correlation coefficient between the covariate and 
the response. 

• Generalized least squares (GLS) is an extension of 
the OLS method, that allows efficient estimation of 
/S when either heteroscedasticity, or correlations, or 
both are present among the error terms of the model, 
as long as the form of heteroscedasticity and correla¬ 
tion is known independently of the data. To handle 
heteroscedasticity when the error terms are uncor¬ 
related with each other, GLS minimizes a weighted 
analogue to the sum of squared residuals from OLS 
regression, where the weight for the z lh case is in¬ 
versely proportional to var(ez). This special case of 
GLS is called “weighted least squares”. The GLS 
solution to estimation problem is 

/3 = (x T rr 1 x)- 1 x T rr 1 y, 

where Q is the covariance matrix of the errors. GLS 
can be viewed as applying a linear transformation to 
the data so that the assumptions of OLS are met for 
the transformed data. For GLS to be applied, the 
covariance structure of the errors must be known up 
to a multiplicative constant. 

• Percentage least squares focuses on reducing per¬ 
centage errors, which is useful in the field of fore¬ 
casting or time series analysis. It is also useful in 
situations where the dependent variable has a wide 
range without constant variance, as here the larger 
residuals at the upper end of the range would domi¬ 
nate if OLS were used. When the percentage or rel¬ 
ative error is normally distributed, least squares per¬ 
centage regression provides maximum likelihood 
estimates. Percentage regression is linked to a mul¬ 
tiplicative error model, whereas OLS is linked to 
models containing an additive error term. 1131 


• Iteratively reweighted least squares (IRLS) is 
used when heteroscedasticity, or correlations, or 
both are present among the error terms of the model, 
but where little is known about the covariance struc¬ 
ture of the errors independently of the data. 1141 In 
the first iteration, OLS, or GLS with a provisional 
covariance structure is carried out, and the residuals 
are obtained from the fit. Based on the residuals, an 
improved estimate of the covariance structure of the 
errors can usually be obtained. A subsequent GLS 
iteration is then performed using this estimate of the 
error structure to define the weights. The process 
can be iterated to convergence, but in many cases, 
only one iteration is sufficient to achieve an efficient 
estimate of /3. [15,[161 

• Instrumental variables regression (IV) can be per¬ 
formed when the regressors are correlated with the 
errors. In this case, we need the existence of some 
auxiliary instrumental variables z i such that E[z iei] 
= 0. If Z is the matrix of instruments, then the esti¬ 
mator can be given in closed form as 

P = (X T Z(Z T Z)- 1 Z T X)- 1 X T Z(Z T Z)” 1 Z T y. 

• Optimal instruments regression is an extension of 
classical IV regression to the situation where E[si\zi] 
= 0 . 

• Total least squares (TLS) [1?1 is an approach to least 
squares estimation of the linear regression model 
that treats the covariates and response variable in a 
more geometrically symmetric manner than OLS. It 
is one approach to handling the “errors in variables” 
problem, and is also sometimes used even when the 
covariates are assumed to be error-free. 

19.3.2 Maximum-likelihood estimation 
and related techniques 

• Maximum likelihood estimation can be per¬ 
formed when the distribution of the error terms is 
known to belong to a certain parametric family fd 
of probability distributions. 1181 When /0 is a nor¬ 
mal distribution with zero mean and variance 0, the 
resulting estimate is identical to the OLS estimate. 
GLS estimates are maximum likelihood estimates 
when e follows a multivariate normal distribution 
with a known covariance matrix. 

• Ridge regression , 1 191120,121 1 and other forms of pe¬ 
nalized estimation such as Lasso regression , 151 de¬ 
liberately introduce bias into the estimation of /S 
in order to reduce the variability of the estimate. 
The resulting estimators generally have lower mean 
squared error than the OLS estimates, particularly 
when multicollinearity is present. They are gener¬ 
ally used when the goal is to predict the value of the 
response variable y for values of the predictors x that 
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have not yet been observed. These methods are not 
as commonly used when the goal is inference, since 
it is difficult to account for the bias. 

• Least absolute deviation (LAD) regression is a 
robust estimation technique in that it is less sensi¬ 
tive to the presence of outliers than OLS (but is less 
efficient than OLS when no outliers are present). It 
is equivalent to maximum likelihood estimation un¬ 
der a Laplace distribution model for e. [22] 

• Adaptive estimation. If we assume that error terms 
are independent from the regressors _L x, , the 
optimal estimator is the 2-step MLE, where the first 
step is used to non-parametrically estimate the dis¬ 
tribution of the error term. 1231 

19.3.3 Other estimation techniques 

• Bayesian linear regression applies the framework 
of Bayesian statistics to linear regression. (See also 
Bayesian multivariate linear regression.) In particu¬ 
lar, the regression coefficients |3 are assumed to be 
random variables with a specified prior distribution. 
The prior distribution can bias the solutions for the 
regression coefficients, in a way similar to (but more 
general than) ridge regression or lasso regression. In 
addition, the Bayesian estimation process produces 
not a single point estimate for the “best” values of the 
regression coefficients but an entire posterior distri¬ 
bution, completely describing the uncertainty sur¬ 
rounding the quantity. This can be used to estimate 
the “best” coefficients using the mean, mode, me¬ 
dian, any quantile (see quantile regression), or any 
other function of the posterior distribution. 

• Quantile regression focuses on the conditional 
quantiles of y given X rather than the conditional 
mean of y given X. Linear quantile regression mod¬ 
els a particular conditional quantile, for example the 
conditional median, as a linear function (3 t jc of the 
predictors. 

• Mixed models are widely used to analyze linear 
regression relationships involving dependent data 
when the dependencies have a known structure. 
Common applications of mixed models include 
analysis of data involving repeated measurements, 
such as longitudinal data, or data obtained from clus¬ 
ter sampling. They are generally fit as parametric 
models, using maximum likelihood or Bayesian es¬ 
timation. In the case where the errors are modeled 
as normal random variables, there is a close con¬ 
nection between mixed models and generalized least 
squares. 1241 Fixed effects estimation is an alternative 
approach to analyzing this type of data. 

• Principal component regression (PCR) [71[8] is 
used when the number of predictor variables is 


large, or when strong correlations exist among the 
predictor variables. This two-stage procedure first 
reduces the predictor variables using principal com¬ 
ponent analysis then uses the reduced variables in 
an OLS regression fit. While it often works well in 
practice, there is no general theoretical reason that 
the most informative linear function of the predictor 
variables should lie among the dominant principal 
components of the multivariate distribution of the 
predictor variables. The partial least squares regres¬ 
sion is the extension of the PCR method which does 
not suffer from the mentioned deficiency. 

• Least-angle regression 161 is an estimation proce¬ 
dure for linear regression models that was developed 
to handle high-dimensional covariate vectors, po¬ 
tentially with more covariates than observations. 

• The Theil-Sen estimator is a simple robust estima¬ 
tion technique that chooses the slope of the fit line 
to be the median of the slopes of the lines through 
pairs of sample points. It has similar statistical ef¬ 
ficiency properties to simple linear regression but is 
much less sensitive to outliers. 1251 

• Other robust estimation techniques, including the a- 
trimmed mean approach, and L-, M-, S-, and R- 
estimators have been introduced. 

19.3.4 Further discussion 

In statistics and numerical analysis, the problem of nu¬ 
merical methods for linear least squares is an impor¬ 
tant one because linear regression models are one of the 
most important types of model, both as formal statistical 
models and for exploration of data sets. The majority of 
statistical computer packages contain facilities for regres¬ 
sion analysis that make use of linear least squares compu¬ 
tations. Hence it is appropriate that considerable effort 
has been devoted to the task of ensuring that these com¬ 
putations are undertaken efficiently and with due regard 
to numerical precision. 

Individual statistical analyses are seldom undertaken in 
isolation, but rather are part of a sequence of investiga¬ 
tory steps. Some of the topics involved in considering 
numerical methods for linear least squares relate to this 
point. Thus important topics can be 

• Computations where a number of similar, and of¬ 
ten nested, models are considered for the same data 
set. That is, where models with the same dependent 
variable but different sets of independent variables 
are to be considered, for essentially the same set of 
data points. 

• Computations for analyses that occur in a sequence, 
as the number of data points increases. 

• Special considerations for very extensive data sets. 
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Fitting of linear models by least squares often, but not al¬ 
ways, arises in the context of statistical analysis. It can 
therefore be important that considerations of computa¬ 
tional efficiency for such problems extend to all of the 
auxiliary quantities required for such analyses, and are not 
restricted to the formal solution of the linear least squares 
problem. 

Matrix calculations, like any others, are affected by 
rounding errors. An early summary of these effects, re¬ 
garding the choice of computational methods for matrix 
inversion, was provided by Wilkinson. 1261 


19.4 Applications of linear regres¬ 
sion 

Linear regression is widely used in biological, behavioral 
and social sciences to describe possible relationships be¬ 
tween variables. It ranks as one of the most important 
tools used in these disciplines. 

19.4.1 Trend line 

Main article: Trend estimation 


19.3.5 Using Linear Algebra 

It follows that one can find a “best” approximation of an¬ 
other function by minimizing the area between two func¬ 
tions, a continuous function / on [a, b] and a function 
g £ W where W is a subspace of C[a,b ] : 

Area = / | f{x) - g(x)\ dx 

J a 


I 

within the subspace W . Due to the frequent difficulty 
of evaluating integrands involving absolute value, one 
can instead define 



I 

an adequate criterion for obtaining the least squares 
approximation, function g , of / with respect to the inner 
product space W . 

As such, ||/ — <?|| 2 or, equivalently, \\f — g|| , can thus be 
written in vector form: 


A trend line represents a trend, the long-term movement 
in time series data after other components have been ac¬ 
counted for. It tells whether a particular data set (say 
GDP, oil prices or stock prices) have increased or de¬ 
creased over the period of time. A trend line could sim¬ 
ply be drawn by eye through a set of data points, but 
more properly their position and slope is calculated using 
statistical techniques like linear regression. Trend lines 
typically are straight lines, although some variations use 
higher degree polynomials depending on the degree of 

curvature desired in the line. 

Trend lines are sometimes used in business analytics to 
show changes in data over time. This has the advantage 
of being simple. Trend lines are often used to argue that 
a particular action or event (such as training, or an ad¬ 
vertising campaign) caused observed changes at a point 
in time. This is a simple technique, and does not require 
a control group, experimental design, or a soph is t i cate d 

analysis technique. However, it suffers from a lack of sci¬ 
entific validity in cases where other potential changes can 
affect the data. 

19.4.2 Epidemiology 


[f{x) - g(x)] 2 dx = {f-g,f-g) = ||/ - g\\ 2 


other words, the least squares approximation of / is the 
function g £ subspace W closest to / in terms of the 
inner product (/, g) . Furthermore, this can be applied 
with a theorem: 

Let / be continuous on [a, b] , and let W be 
a finite-dimensional subspace of C[a , b] . The 
least squares approximating function of / with 
respect to IT' is given by 


g = (/, Wi) wi + (/, w 2 ) w 2 -\ -h (/, Wn) Wn 

where B = {in i,w 2 , ..., in ra } is an orthonor¬ 
mal basis for W . 


Early evidence relating tobacco smoking to mortality and 
morbidity came from observational studies employing re¬ 
gression analysis. In order to reduce spurious correlations 
when analyzing observalionakdala, researchers usually in- 

clude several variables in their regression models in addi¬ 
tion to the variable of primary interest. For example, sup¬ 
pose we have a regression model in which cigarette smok¬ 
ing is the independent variable of interest, and the depen¬ 
dent variable is lifespan measured in years. Researchers 
might include socio-economic status as an additional in¬ 
dependent variable, to ensure that any observed effect of 
smoking on lifespan is not due to some effect of educa¬ 
tion or income. However, it is never possible to include 
all possible confounding variables in an empirical anal¬ 
ysis. For example, a hypothetical gene might increase 
mortality and also cause people to smoke more. For this 
reason, randomized controlled trials are often able to gen¬ 
erate more compelling evidence of causal relationships 
than can be obtained using regression analyses of obser- 
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vational data. When controlled experiments are not fea¬ 
sible, variants of regression analysis such as instrumental 
variables regression may be used to attempt to estimate 
causal relationships from observational data. 

19.4.3 Finance 

The capital asset pricing model uses linear regression as 
well as the concept of beta for analyzing and quantifying 
the systematic risk of an investment. This comes directly 
from the beta coefficient of the linear regression model 
that relates the return on the investment to the return on 
all risky assets. 

19.4.4 Economics 

Main article: Econometrics 

Linear regression is the predominant empirical tool 
in economics. For example, it is used to pre¬ 
dict consumption spending, 1271 fixed investment spend¬ 
ing, inventory investment, purchases of a country’s 
exports, 1281 spending on imports, 1281 the demand to hold 
liquid assets, 1291 labor demand, 1301 and labor supply. 1301 

19.4.5 Environmental science 

Linear regression finds application in a wide range of 
environmental science applications. In Canada, the En¬ 
vironmental Effects Monitoring Program uses statistical 
analyses on fish and benthic surveys to measure the ef¬ 
fects of pulp mill or metal mine effluent on the aquatic 
ecosystem. 1311 

19.5 See also 

• Analysis of variance 

• Censored regression model 

• Cross-sectional regression 

• Curve fitting 

• Empirical Bayes methods 

• Errors and residuals 

• Lack-of-fit sum of squares 

• Linear classifier 

• Logistic regression 

• M-estimator 

• MLPACK contains a C++ implementation of linear 
regression 


• Multivariate adaptive regression splines 

• Nonlinear regression 

• Nonparametric regression 

• Normal equations 

• Projection pursuit regression 

• Segmented linear regression 

• Stepwise regression 

• Support vector machine 

• Truncated regression model 


19.6 Notes 

[1] David A. Freedman (2009). Statistical Models: Theory 
and Practice. Cambridge University Press, p. 26. A 
simple regression equation has on the right hand side an 
intercept and an explanatory variable with a slope coeffi¬ 
cient. A multiple regression equation has several explana¬ 
tory variables on the right hand side, each with its own 
slope coefficient 

[2] Rencher, Alvin C.; Christensen, William F. (2012), 
“Chapter 10, Multivariate regression - Section 10.1, In¬ 
troduction”, Methods of Multivariate Analysis, Wiley Se¬ 
ries in Probability and Statistics 709 (3rd ed.), John Wiley 
& Sons, p. 19, ISBN 9781118391679. 

[3] Hilary L. Seal (1967). “The historical development of 
the Gauss linear model”. Biometrika 54 (1/2): 1-24. 
doi: 10.1093/biomet/54.1-2.1. 

[4] Yan, Xin (2009), Linear Regression Analysis: The¬ 
ory and Computing, World Scientific, pp. 1-2, ISBN 
9789812834119, Regression analysis ... is probably one 
of the oldest topics in mathematical statistics dating back 
to about two hundred years ago. The earliest form of the 
linear regression was the least squares method, which was 
published by Legendre in 1805. and by Gauss in 1809 ... 
Legendre and Gauss both applied the method to the prob¬ 
lem of determining, from astronomical observations, the 
orbits of bodies about the sun. 
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Chapter 20 

Tikhonov regularization 


Tikhonov regularization, named for Andrey Tikhonov, 
is the most commonly used method of regularization of 
ill-posed problems. In statistics, the method is known 
as ridge regression, and with multiple independent dis¬ 
coveries, it is also variously known as the Tikhonov- 
Miller method, the Phillips-Twomey method, the con¬ 
strained linear inversion method, and the method of 
linear regularization. It is related to the Levenberg- 
Marquardt algorithm for non-linear least-squares prob¬ 
lems. 

When the following problem is not well posed (either be¬ 
cause of non-existence or non-uniqueness of x ) 


Ax = b, 

then the standard approach (known as ordinary least 
squares) leads to an overdetermined, or more often an 
underdetermined system of equations. Most real-world 
phenomena operate as low-pass filters in the forward di¬ 
rection where A maps x to b . Therefore in solving the 
inverse-problem, the inverse-mapping operates as a high- 
pass filter that has the undesirable tendency of amplifying 
noise (eigenvalues / singular values are largest in the re¬ 
verse mapping where they were smallest in the forward 
mapping). In addition, ordinary least squares implicitly 
nullifies every element of the reconstructed version of x 
that is in the null-space of A , rather than allowing for a 
model to be used as a prior for x . Ordinary least squares 
seeks to minimize the sum of squared residuals, which 
can be compactly written as 

Px-b|| 2 

where ||-|| is the Euclidean norm. In order to give prefer¬ 
ence to a particular solution with desirable properties, a 
regularization term can be included in this minimization: 

||Ax-b || 2 + ||rx || 2 

for some suitably chosen Tikhonov matrix, T . In many 
cases, this matrix is chosen as a multiple of the identity 
matrix ( F = al ), giving preference to solutions with 


smaller norms; this is known as L 2 regularization. ['] In 
other cases, lowpass operators (e.g., a difference opera¬ 
tor or a weighted Fourier operator) may be used to en¬ 
force smoothness if the underlying vector is believed to 
be mostly continuous. This regularization improves the 
conditioning of the problem, thus enabling a direct nu¬ 
merical solution. An explicit solution, denoted by x , is 
given by: 


x = (A T A + T T T)- 1 A T b 

The effect of regularization may be varied via the scale of 
matrix F . For F = 0 this reduces to the unregularized 
least squares solution provided that (A T A) -1 exists. 

L 2 regularization is used in many contexts aside from 
linear regression, such as classification with logistic 
regression or support vector machines, [2] and matrix 
factorization. 131 


20.1 History 

Tikhonov regularization has been invented independently 
in many different contexts. It became widely known from 
its application to integral equations from the work of 
Andrey Tikhonov and David L. Phillips. Some authors 
use the term Tikhonov-Phillips regularization. The 
finite-dimensional case was expounded by Arthur E. Ho- 
erl, who took a statistical approach, and by Manus Foster, 
who interpreted this method as a Wiener-Kolmogorov fil¬ 
ter. Following Hoerl, it is known in the statistical litera¬ 
ture as ridge regression. 

20.2 Generalized Tikhonov regu¬ 
larization 

For general multivariate normal distributions for x and 
the data error, one can apply a transformation of the vari¬ 
ables to reduce to the case above. Equivalently, one can 
seek an x to minimize 
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II Ax — b\\p + ||a; — £o||q 

where we have used ||x||q to stand for the weighted norm 
x T Qx (compare with the Mahalanobis distance). In the 
Bayesian interpretation P is the inverse covariance matrix 
of b , xq is the expected value of x , and Q is the inverse 
covariance matrix of x . The Tikhonov matrix is then 
given as a factorization of the matrix Q = T t F (e.g. 
the Cholesky factorization), and is considered a whitening 
filter. 

This generalized problem has an optimal solution x* 
which can be solved explicitly using the formula 


and is zero elsewhere. This demonstrates the effect of 
the Tikhonov parameter on the condition number of the 
regularized problem. For the generalized case a similar 
representation can be derived using a generalized singular 
value decomposition. 

Finally, it is related to the Wiener filter: 





Vi 


where the Wiener weights are fi 
rank of A . 


2 °) o and q is the 

rrf+a 2 y 


X* = ( A t PA + Q)-\A T Pb + Qx o). 

or equivalently 


20.5 Determination of the 
Tikhonov factor 


x* =x 0 + ( A t PA + Q)-\A T P(b - Axo)). 

20.3 Regularization in Hilbert 
space 

Typically discrete linear ill-conditioned problems result 
from discretization of integral equations, and one can for¬ 
mulate a Tikhonov regularization in the original infinite¬ 
dimensional context. In the above we can interpret A as a 
compact operator on Hilbert spaces, and x and b as el¬ 
ements in the domain and range of A . The operator 
A* A + T t T is then a self-adjoint bounded invertible op¬ 
erator. 


The optimal regularization parameter a is usually un¬ 
known and often in practical problems is determined 
by an ad hoc method. A possible approach relies 
on the Bayesian interpretation described below. Other 
approaches include the discrepancy principle, cross- 
validation, L-curve method, restricted maximum like¬ 
lihood and unbiased predictive risk estimator. Grace 
Wahba proved that the optimal parameter, in the sense 
of leave-one-out cross-validation minimizes: 

2 

_ rss _ X P ~ y 

' " ~ [Tr [I - X{X T X + a 2 I)~ 1 X T )] 2 

where RSS is the residual sum of squares and r is the 
effective number of degrees of freedom. 


20.4 Relation to singular value de¬ 
composition and Wiener filter 

With T = al , this least squares solution can be ana¬ 
lyzed in a special way via the singular value decomposi¬ 
tion. Given the singular value decomposition of A 


A = UT,V t 


Using the previous SVD decomposition, we can simplify 
the above expression: 
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with singular values Oi , the Tikhonov regularized solu¬ 
tion can be expressed as 


= TO-^ 


2=1 


or 


= m — q + 


Q 

E 

2 = 1 


erf + a 2 


x = VDU T b 

where D has diagonal values 


20.6 Relation to probabilistic for¬ 
mulation 


Du — 


Vi 


vf + a 1 


The probabilistic formulation of an inverse problem in¬ 
troduces (when all uncertainties are Gaussian) a covari- 
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ance matrix Cm representing the a priori uncertainties 
on the model parameters, and a covariance matrix Cd 
representing the uncertainties on the observed parame¬ 
ters (see, for instance, Tarantola, 2005 ). In the special 
case when these two matrices are diagonal and isotropic, 
Cm = &mI and Cd = o 2 D I , and, in this case, the equa¬ 
tions of inverse theory reduce to the equations above, with 
ol = o’u/o’m • 

20.7 Bayesian interpretation 

Further information: Minimum mean square error § 
Linear MMSE estimator for linear observation process 

Although at first the choice of the solution to this regular¬ 
ized problem may look artificial, and indeed the matrix T 
seems rather arbitrary, the process can be justified from 
a Bayesian point of view. Note that for an ill-posed prob¬ 
lem one must necessarily introduce some additional as¬ 
sumptions in order to get a unique solution. Statistically, 
the prior probability distribution of x is sometimes taken 
to be a multivariate normal distribution. For simplicity 
here, the following assumptions are made: the means are 
zero; their components are independent; the components 
have the same standard deviation a x . The data are also 
subject to errors, and the errors in b are also assumed to 
be independent with zero mean and standard deviation a f, 
. Under these assumptions the Tikhonov-regularized so¬ 
lution is the most probable solution given the data and the 
a priori distribution of x , according to Bayes’ theorem. 141 

If the assumption of normality is replaced by assumptions 
of homoskedasticity and uncorrelatedness of errors, and 
if one still assumes zero mean, then the Gauss-Markov 
theorem entails that the solution is the minimal unbiased 
estimator. 


20.8 See also 

• LASSO estimator is another regularization method 
in statistics. 
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Chapter 21 

Regression analysis 


In statistics, regression analysis is a statistical process 
for estimating the relationships among variables. It in¬ 
cludes many techniques for modeling and analyzing sev¬ 
eral variables, when the focus is on the relationship be¬ 
tween a dependent variable and one or more independent 
variables (or 'predictors’). More specifically, regression 
analysis helps one understand how the typical value of the 
dependent variable (or 'criterion variable') changes when 
any one of the independent variables is varied, while the 
other independent variables are held fixed. Most com¬ 
monly, regression analysis estimates the conditional ex¬ 
pectation of the dependent variable given the indepen¬ 
dent variables - that is, the average value of the dependent 
variable when the independent variables are fixed. Less 
commonly, the focus is on a quantile, or other location pa¬ 
rameter of the conditional distribution of the dependent 
variable given the independent variables. In all cases, the 
estimation target is a function of the independent vari¬ 
ables called the regression function. In regression analy¬ 
sis, it is also of interest to characterize the variation of the 
dependent variable around the regression function which 
can be described by a probability distribution. 

Regression analysis is widely used for prediction and 
forecasting, where its use has substantial overlap with the 
field of machine learning. Regression analysis is also used 
to understand which among the independent variables 
are related to the dependent variable, and to explore the 
forms of these relationships. In restricted circumstances, 
regression analysis can be used to infer causal relation¬ 
ships between the independent and dependent variables. 
However this can lead to illusions or false relationships, 
so caution is advisable 4 1] for example, correlation does 
not imply causation. 

Many techniques for carrying out regression analysis have 
been developed. Familiar methods such as linear regres¬ 
sion and ordinary least squares regression are parametric, 
in that the regression function is defined in terms of a fi¬ 
nite number of unknown parameters that are estimated 
from the data. Nonparametric regression refers to tech¬ 
niques that allow the regression function to lie in a speci¬ 
fied set of functions, which may be infinite-dimensional. 

The performance of regression analysis methods in prac¬ 
tice depends on the form of the data generating pro¬ 
cess, and how it relates to the regression approach be¬ 


ing used. Since the true form of the data-generating pro¬ 
cess is generally not known, regression analysis often de¬ 
pends to some extent on making assumptions about this 
process. These assumptions are sometimes testable if a 
sufficient quantity of data is available. Regression mod¬ 
els for prediction are often useful even when the assump¬ 
tions are moderately violated, although they may not per¬ 
form optimally. However, in many applications, espe¬ 
cially with small effects or questions of causality based 
on observational data, regression methods can give mis¬ 
leading results. ,21 ,3) 


21.1 History 

The earliest form of regression was the method of least 
squares, which was published by Legendre in 1805, 141 and 
by Gauss in 1809. 151 Legendre and Gauss both applied 
the method to the problem of determining, from astro¬ 
nomical observations, the orbits of bodies about the Sun 
(mostly comets, but also later the then newly discovered 
minor planets). Gauss published a further development 
of the theory of least squares in 1821, ,61 including a ver¬ 
sion of the Gauss-Markov theorem. 

The term “regression” was coined by Francis Galton 
in the nineteenth century to describe a biological phe¬ 
nomenon. The phenomenon was that the heights of de¬ 
scendants of tall ancestors tend to regress down towards a 
normal average (a phenomenon also known as regression 
toward the mean). 17118 For Galton, regression had only 
this biological meaning, 1911101 but his work was later ex¬ 
tended by Udny Yule and Karl Pearson to a more general 
statistical context. 11111121 In the work of Yule and Pear¬ 
son, the joint distribution of the response and explana¬ 
tory variables is assumed to be Gaussian. This assump¬ 
tion was weakened by R.A. Fisher in his works of 1922 
and 1925. 113)114,1151 Fisher assumed that the conditional 
distribution of the response variable is Gaussian, but the 
joint distribution need not be. In this respect, Fisher’s 
assumption is closer to Gauss’s formulation of 1821. 

In the 1950s and 1960s, economists used electromechani¬ 
cal desk calculators to calculate regressions. Before 1970, 
it sometimes took up to 24 hours to receive the result from 
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one regression. 1161 

Regression methods continue to be an area of active re¬ 
search. In recent decades, new methods have been de¬ 
veloped for robust regression, regression involving cor¬ 
related responses such as time series and growth curves, 
regression in which the predictor (independent variable) 
or response variables are curves, images, graphs, or other 
complex data objects, regression methods accommodat¬ 
ing various types of missing data, nonparametric re¬ 
gression, Bayesian methods for regression, regression in 
which the predictor variables are measured with error, re¬ 
gression with more predictor variables than observations, 
and causal inference with regression. 

21.2 Regression models 

Regression models involve the following variables: 

• The unknown parameters, denoted as (I, which 
may represent a scalar or a vector. 

• The independent variables, X. 

• The dependent variable, Y. 

In various fields of application, different terminologies are 
used in place of dependent and independent variables. 

A regression model relates Y to a function of X and p. 

Y ~ /(X,/3) 

The approximation is usually formalized as E(Y I X) = 
/(X, P). To carry out regression analysis, the form of 
the function / must be specified. Sometimes the form of 
this function is based on knowledge about the relationship 
between Y and X that does not rely on the data. If no such 
knowledge is available, a flexible or convenient form for 
/ is chosen. 

Assume now that the vector of unknown parameters P 
is of length k. In order to perform a regression analysis 
the user must provide information about the dependent 
variable Y: 

• If N data points of the form ( V. X) are observed, 
where N < k , most classical approaches to regres¬ 
sion analysis cannot be performed: since the system 
of equations defining the regression model is under¬ 
determined, there are not enough data to recover p. 

• If exactly N = k data points are observed, and the 
function / is linear, the equations Y = /(X, P) can 
be solved exactly rather than approximately. This 
reduces to solving a set of N equations with N un¬ 
knowns (the elements of P), which has a unique so¬ 
lution as long as the X are linearly independent. If / 
is nonlinear, a solution may not exist, or many solu¬ 
tions may exist. 


• The most common situation is where N > k data 
points are observed. In this case, there is enough 
information in the data to estimate a unique value 
for P that best fits the data in some sense, and the 
regression model when applied to the data can be 
viewed as an overdetermined system in p. 

In the last case, the regression analysis provides the tools 
for: 

1. Finding a solution for unknown parameters P that 
will, for example, minimize the distance between 
the measured and predicted values of the dependent 
variable Y (also known as method of least squares). 

2. Under certain statistical assumptions, the regression 
analysis uses the surplus of information to provide 
statistical information about the unknown parame¬ 
ters P and predicted values of the dependent variable 

Y. 


21.2.1 Necessary number of independent 
measurements 

Consider a regression model which has three unknown 
parameters, |\), (Ti, and (U. Suppose an experimenter 
performs 10 measurements all at exactly the same value 
of independent variable vector X (which contains the in¬ 
dependent variables X±, X-j, and X :i ). In this case, regres¬ 
sion analysis fails to give a unique set of estimated values 
for the three unknown parameters; the experimenter did 
not provide enough information. The best one can do is 
to estimate the average value and the standard deviation 
of the dependent variable Y. Similarly, measuring at two 
different values of X would give enough data for a re¬ 
gression with two unknowns, but not for three or more 
unknowns. 

If the experimenter had performed measurements at 
three different values of the independent variable vector 
X, then regression analysis would provide a unique set of 
estimates for the three unknown parameters in p. 

In the case of general linear regression, the above state¬ 
ment is equivalent to the requirement that the matrix X T X 
is invertible. 


21.2.2 Statistical assumptions 

When the number of measurements, N. is larger than the 
number of unknown parameters, k, and the measurement 
errors are normally distributed then the excess of in¬ 
formation contained in (N - k ) measurements is used to 
make statistical predictions about the unknown param¬ 
eters. This excess of information is referred to as the 
degrees of freedom of the regression. 
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21.3 Underlying assumptions 

Classical assumptions for regression analysis include: 

• The sample is representative of the population for 
the inference prediction. 

• The error is a random variable with a mean of zero 
conditional on the explanatory variables. 


In linear regression, the model specification is that the de¬ 
pendent variable, y t is a linear combination of the param¬ 
eters (but need not be linear in the independent variables). 
For example, in simple linear regression for modeling n 
data points there is one independent variable: Xi , and 
two parameters, A) and At : 

Vi = At + + £», i = 


• The independent variables are measured with no er¬ 
ror. (Note: If this is not so, modeling may be done 
instead using errors-in-variables model techniques). 

• The independent variables (predictors) are linearly 
independent, i.e. it is not possible to express any 
predictor as a linear combination of the others. 

• The errors are uncorrelated, that is, the variance- 
covariance matrix of the errors is diagonal and each 
non-zero element is the variance of the error. 

• The variance of the error is constant across obser¬ 
vations (homoscedasticity). If not, weighted least 
squares or other methods might instead be used. 

These are sufficient conditions for the least-squares esti¬ 
mator to possess desirable properties; in particular, these 
assumptions imply that the parameter estimates will be 
unbiased, consistent, and efficient in the class of Unear 
unbiased estimators. It is important to note that actual 
data rarely satisfies the assumptions. That is, the method 
is used even though the assumptions are not true. Vari¬ 
ation from the assumptions can sometimes be used as a 
measure of how far the model is from being useful. Many 
of these assumptions may be relaxed in more advanced 
treatments. Reports of statistical analyses usuaUy include 
analyses of tests on the sample data and methodology for 
the fit and usefulness of the model. 

Assumptions include the geometrical support of the 
variables. 1171 Independent and dependent variables often 
refer to values measured at point locations. There may be 
spatial trends and spatial autocorrelation in the variables 
that violate statistical assumptions of regression. Geo¬ 
graphic weighted regression is one technique to deal with 
such data. 1181 Also, variables may include values aggre¬ 
gated by areas. With aggregated data the modifiable areal 
unit problem can cause extreme variation in regression 
parameters. 1191 When analyzing data aggregated by polit¬ 
ical boundaries, postal codes or census areas results may 
be very distinct with a different choice of units. 


21.4 Linear regression 

Main article: Linear regression 

See simple linear regression for a derivation of these 
formulas and a numerical example 


In multiple linear regression, there are several indepen¬ 
dent variables or functions of independent variables. 

Adding a term in xi 2 to the preceding regression gives: 


Vi = A) + A Xi + A*? + i = 1,... ,n. 

This is still linear regression; although the expression on 
the right hand side is quadratic in the independent variable 
Xi , it is linear in the parameters A> , A and /? 2 • 

In both cases, e,; is an error term and the subscript i in¬ 
dexes a particular observation. 

Returning our attention to the straight line case: Given 
a random sample from the population, we estimate the 
population parameters and obtain the sample linear re¬ 
gression model: 


Vi = At + PiXi. 

The residual, e* = yi~yt , is the difference between the 
value of the dependent variable predicted by the model, 
yi , and the true value of the dependent variable, yi . 
One method of estimation is ordinary least squares. This 
method obtains parameter estimates that minimize the 
sum of squared residuals, SSL, 121111211 also sometimes de¬ 
noted RSS: 


n 

SSE = Y, e l 

i=i 

Minimization of this function results in a set of normal 
equations, a set of simultaneous linear equations in the 
parameters, which are solved to yield the parameter esti¬ 
mators, A), A ■ 

In the case of simple regression, the formulas for the least 
squares estimates are 


A = 


EQu - x )i.Vi ~ v ) 

Yj{ x % - x ) 2 


and $q = y - A* 


where x is the mean (average) of the x values and y is the 
mean of the y values. 

Under the assumption that the population error term has 
a constant variance, the estimate of that variance is given 
by: 
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Si — Vi BlXil * * * BpXip. 

The normal equations are 

n p n 

yy XijXikfik = y \ x \jy%, j — i,... .p. 

i=l k=1 i— 1 

In matrix notation, the normal equations are written as 


Illustration of linear regression on a data set. 


SSE 

< = -o' 

This is called the mean square error (MSE) of the regres¬ 
sion. The denominator is the sample size reduced by the 
number of model parameters estimated from the same 
data, (n-p) for p regressors or (n-p— 1) if an intercept is 
used. 1221 In this case, p= 1 so the denominator is n—2. 

The standard errors of the parameter estimates are given 
by 


(X T X)$ = x t y, 

where the ij element of X is xij, the i element of the col¬ 
umn vector Y is yi, and the j element of j3 is 8 j . Thus X 
is nxp , Y is nxl, and ft is px 1. The solution is 

$ = (X t X)- 1 X t Y. 

21.4.2 Diagnostics 

See also: Category:Regression diagnostics. 


= 



- X) 2 


<?01 



Under the further assumption that the population error 
term is normally distributed, the researcher can use these 
estimated standard errors to create confidence intervals 
and conduct hypothesis tests about the population param¬ 
eters. 


21.4.1 General linear model 

For a derivation, see linear least squares 
For a numerical example, see linear regression 


In the more general multiple regression model, there are 
p independent variables: 


Once a regression model has been constructed, it may be 
important to confirm the goodness of fit of the model and 
the statistical significance of the estimated parameters. 
Commonly used checks of goodness of fit include the R- 
squared, analyses of the pattern of residuals and hypoth¬ 
esis testing. Statistical significance can be checked by an 
F-test of the overall fit, followed by t-tests of individual 
parameters. 

Interpretations of these diagnostic tests rest heavily on the 
model assumptions. Although examination of the resid¬ 
uals can be used to invalidate a model, the results of a t- 
test or F-test are sometimes more difficult to interpret if 
the model’s assumptions are violated. For example, if the 
error term does not have a normal distribution, in small 
samples the estimated parameters will not follow normal 
distributions and complicate inference. With relatively 
large samples, however, a central limit theorem can be 
invoked such that hypothesis testing may proceed using 
asymptotic approximations. 


21.4.3 “Limited dependent” variables 


Vi = BiXil + B2X12 H-1- BpXip + Si, 

where xij is the z th observation on the j th independent vari¬ 
able, and where the first independent variable takes the 
value 1 for all i (so Bi is the regression intercept). 

The least squares parameter estimates are obtained from 
p normal equations. The residual can be written as 


The phrase “limited dependent” is used in econometric 
statistics for categorical and constrained variables. 

The response variable may be non-continuous (“limited” 
to he on some subset of the real line). For binary (zero or 
one) variables, if analysis proceeds with least-squares lin¬ 
ear regression, the model is called the linear probability 
model. Nonlinear models for binary dependent variables 
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include the probit and logit model. The multivariate pro¬ 
bit model is a standard method of estimating a joint rela¬ 
tionship between several binary dependent variables and 
some independent variables. For categorical variables 
with more than two values there is the multinomial logit. 
For ordinal variables with more than two values, there are 
the ordered logit and ordered probit models. Censored 
regression models may be used when the dependent vari¬ 
able is only sometimes observed, and Heckman correc¬ 
tion type models may be used when the sample is not 
randomly selected from the population of interest. An 
alternative to such procedures is linear regression based 
on polychoric correlation (or polyserial correlations) be¬ 
tween the categorical variables. Such procedures differ in 
the assumptions made about the distribution of the vari¬ 
ables in the population. If the variable is positive with low 
values and represents the repetition of the occurrence of 
an event, then count models like the Poisson regression 
or the negative binomial model may be used instead. 

21.5 Interpolation and extrapola¬ 
tion 

Regression models predict a value of the Y variable given 
known values of the X variables. Prediction within the 
range of values in the dataset used for model-fitting is 
known informally as interpolation. Prediction outside this 
range of the data is known as extrapolation. Performing 
extrapolation relies strongly on the regression assump¬ 
tions. The further the extrapolation goes outside the data, 
the more room there is for the model to fail due to dif¬ 
ferences between the assumptions and the sample data or 
the true values. 

It is generally advised that when performing extrapola¬ 
tion, one should accompany the estimated value of the de¬ 
pendent variable with a prediction interval that represents 
the uncertainty. Such intervals tend to expand rapidly as 
the values of the independent variable(s) moved outside 
the range covered by the observed data. 

For such reasons and others, some tend to say that it might 
be unwise to undertake extrapolation. 1231 

However, this does not cover the full set of modelling er¬ 
rors that may be being made: in particular, the assump¬ 
tion of a particular form for the relation between Y and X. 
A properly conducted regression analysis will include an 
assessment of how well the assumed form is matched by 
the observed data, but it can only do so within the range 
of values of the independent variables actually available. 
This means that any extrapolation is particularly reliant 
on the assumptions being made about the structural form 
of the regression relationship. Best-practice advice here 
is that a linear-in-variables and linear-in-parameters rela¬ 
tionship should not be chosen simply for computational 
convenience, but that all available knowledge should be 
deployed in constructing a regression model. If this 


knowledge includes the fact that the dependent variable 
cannot go outside a certain range of values, this can be 
made use of in selecting the model - even if the observed 
dataset has no values particularly near such bounds. The 
implications of this step of choosing an appropriate func¬ 
tional form for the regression can be great when extrap¬ 
olation is considered. At a minimum, it can ensure that 
any extrapolation arising from a fitted model is “realistic” 
(or in accord with what is known). 


21.6 Nonlinear regression 

Main article: Nonlinear regression 

When the model function is not linear in the parameters, 
the sum of squares must be minimized by an iterative pro¬ 
cedure. This introduces many complications which are 
summarized in Differences between linear and non-linear 
least squares 

21.7 Power and sample size calcu¬ 
lations 

There are no generally agreed methods for relating the 
number of observations versus the number of indepen¬ 
dent variables in the model. One rule of thumb suggested 
by Good and Hardin is N = m n , where N is the sam¬ 
ple size, n is the number of independent variables and m 
is the number of observations needed to reach the de¬ 
sired precision if the model had only one independent 
variable. 1241 For example, a researcher is building a lin¬ 
ear regression model using a dataset that contains 1000 
patients (N). If the researcher decides that five observa¬ 
tions are needed to precisely define a straight line ( m ), 
then the maximum number of independent variables the 
model can support is 4, because 

tog loop _ ^ 29 
log 5 ' 

21.8 Other methods 

Although the parameters of a regression model are usu¬ 
ally estimated using the method of least squares, other 
methods which have been used include: 

• Bayesian methods, e.g. Bayesian linear regression 

• Percentage regression, for situations where reducing 
percentage errors is deemed more appropriate. 1251 

• Least absolute deviations, which is more robust in 
the presence of outliers, leading to quantile regres¬ 
sion 
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• Nonparametric regression, requires a large number 
of observations and is computationally intensive 

• Distance metric learning, which is learned by the 
search of a meaningful distance metric in a given 
input space. 1261 

21.9 Software 

Main article: List of statistical packages 

All major statistical software packages perform least 
squares regression analysis and inference. Simple linear 
regression and multiple regression using least squares can 
be done in some spreadsheet applications and on some 
calculators. While many statistical software packages can 
perform various types of nonparametric and robust re¬ 
gression, these methods are less standardized; different 
software packages implement different methods, and a 
method with a given name may be implemented differ¬ 
ently in different packages. Specialized regression soft¬ 
ware has been developed for use in fields such as survey 
analysis and neuroimaging. 

21.10 See also 

• Confidence Interval for Maximin Effects in Inhomo¬ 
geneous Data 

• Curve fitting 

• Estimation Theory 

• Forecasting 

• Fraction of variance unexplained 

• Function approximation 

• Kriging (a linear least squares estimation algorithm) 

• Local regression 

• Modifiable areal unit problem 

• Multivariate adaptive regression splines 

• Multivariate normal distribution 

• Pearson product-moment correlation coefficient 

• Prediction interval 

• Robust regression 

• Segmented regression 

• Signal processing 

• Stepwise regression 
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Chapter 22 

Statistical learning theory 


See also: Computational learning theory 
This article is about statistical learning in machine learn¬ 
ing. For its use in psychology, see Statistical learning in 
language acquisition. 

Statistical learning theory is a framework for machine 
learning drawing from the fields of statistics and 
functional analysis. Statistical learning theory deals 
with the problem of finding a predictive function based 
on data. Statistical learning theory has led to success¬ 
ful applications in fields such as computer vision, speech 
recognition, bioinformatics and baseball. 12 It is the theo¬ 
retical framework underlying support vector machines. 

22.1 Introduction 

The goal of learning is prediction. Learning falls 
into many categories, including supervised learn¬ 
ing, unsupervised learning, online learning, and 
reinforcement learning. From the perspective of sta¬ 
tistical learning theory, supervised learning is best 
understood. 131 Supervised learning involves learning 
from a training set of data. Every point in the training 
is an input-output pair, where the input maps to an 
output. The learning problem consists of inferring the 
function that maps between the input and the output in a 
predictive fashion, such that the learned function can be 
used to predict output from future input. 

Depending of the type of output, supervised learning 
problems are either problems of regression or problems 
of classification. If the output takes a continuous range of 
values, it is a regression problem. Using Ohm’s Law as 
an example, a regression could be performed with voltage 
as input and current as output. The regression would find 
the functional relationship between voltage and current to 
be jj , such that 



Classification problems are those for which the output 
will be an element from a discrete set of labels. Classifi¬ 
cation is very common for machine learning applications. 


In facial recognition, for instance, a picture of a person’s 
face would be the input, and the output label would be 
that person’s name. The input would be represented by a 
large multidimensional vector, in which each dimension 
represents the value of one of the pixels. 

After learning a function based on the training set data, 
that function is validated on a test set of data, data that 
did not appear in the training set. 


22.2 Formal Description 

Take X to be the vector space of all possible inputs, and 
Y to be the vector space of all possible outputs. Statistical 
learning theory takes the perspective that there is some 
unknown probability distribution over the product space 
Z = X C/) V , i.e. there exists some unknown p(z) = 
j)(x. y) . The training set is made up of n samples from 
this probability distribution, and is notated 


S = {(fi,yi),..., ( x n ,y n )} = {z u ■ ■ ■ ,z n } 

Every xi is an input vector from the training data, and y, 
is the output that corresponds to it. 

In this formalism, the inference problem consists of find¬ 
ing a function / : X H > Y such that f(x) ~ y . Let H 
be a space of functions / : A' i —> Y called the hypothesis 
space. The hypothesis space is the space of functions the 
algorithm will search through. Let V (/(x), y) be the loss 
functional, a metric for the difference between the pre¬ 
dicted value f(x) and the actual value y . The expected 
risk is defined to be 


I[f] = [ V(f(x),y)p(x,y)dxdy 

JX®Y 

The target function, the best possible function / that can 
be chosen, is given by the / that satisfies 
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Because the probability distribution p(x, y ) is unknown, 
a proxy measure for the expected risk must be used. 
This measure is based on the training set, a sample from 
this unknown probability distribution. It is called the 
empirical risk 


1 

1 i=l 

A learning algorithm that chooses the function fg that 
minimizes the empirical risk is called empirical risk min¬ 
imization. 


22.3 Loss Functions 

The choice of loss function is a determining factor on the 
function fg that will be chosen by the learning algorithm. 
The loss function also affects the convergence rate for 
an algorithm. It is important for the loss function to be 
convex. 141 

Different loss functions are used depending on whether 
the problem is one of regression or one of classification. 

22.3.1 Regression 

The most common loss function for regression is the 
square loss function. This familiar loss function is used 
in ordinary least squares regression. The form is: 

V(f(x),y) = (y- /(f)) 2 

The absolute value loss is also sometimes used: 


V(f(x),y) = \y- f(x)\ 

22.3.2 Classification 

Main article: Statistical classification 

In some sense the 0-1 indicator function is the most nat¬ 
ural loss function for classification. It takes the value 0 
if the predicted output is the same as the actual output, 
and it takes the value 1 if the predicted output is different 
from the actual output. For binary classification, this is: 

V{f(x,y)) = 9(-yf(x)) 

where 9 is the Heaviside step function. 

The 0-1 loss function, however, is not convex. The hinge 
loss is thus often used: 


V(f{x,y)) = {-yf(x))+ 

22.4 Regularization 



This image represents an example of overfitting in machine learn¬ 
ing. The red dots represent training set data. The green tine rep¬ 
resents the true functional relationship, while the blue line shows 
the learned function, which has fallen victim to overfitting. 

In machine learning problems, a major problem that 
arises is that of overfitting. Because learning is a predic¬ 
tion problem, the goal is not to find a function that most 
closely fits the (previously observed) data, but to find one 
that will most accurately predict output from future input. 
Empirical risk minimization runs this risk of overfitting: 
finding a function that matches the data exactly but does 
not predict future output well. 

Overfitting is symptomatic of unstable solutions; a small 
perturbation in the training set data would cause a large 
variation in the learned function. It can be shown that 
if the stability for the solution can be guaranteed, gen¬ 
eralization and consistency are guaranteed as well. 151161 
Regularization can solve the overfitting problem and give 
the problem stability. 

Regularization can be accomplished by restricting the hy¬ 
pothesis space H . A common example would be restrict¬ 
ing TL to linear functions: this can be seen as a reduction 
to the standard problem of linear regression. % could also 
be restricted to polynomial of degree p , exponentials, or 
bounded functions on LI. Restriction of the hypothesis 
space avoids overfitting because the form of the potential 
functions are limited, and so does not allow for the choice 
of a function that gives empirical risk arbitrarily close to 
zero. 

One example of regularization is Tikhonov regulariza¬ 
tion. This consists of minimizing 
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-^ v U{xuyi)) + 7ll/llw 

Tl 

i=l 

where 7 is a fixed and positive parameter, the regular¬ 
ization parameter. Tikhonov regularization ensures exis- 
tence, uniqueness, and stability of the solution . 171 


22.5 See also 

• Reproducing kernel Hilbert spaces are a useful 
choice for H . 

• Proximal gradient methods for learning 
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Chapter 23 

Vapnik-Chervonenkis theory 


Vapnik-Chervonenkis theory (also known as VC the¬ 
ory) was developed during 1960-1990 by Vladimir Vap- 
nik and Alexey Chervonenkis. The theory is a form of 
computational learning theory, which attempts to explain 
the learning process from a statistical point of view. 

VC theory is related to statistical learning theory and to 
empirical processes. Richard M. Dudley and Vladimir 
Vapnik himself, among others, apply VC-theory to 
empirical processes. 

23.1 Introduction 

VC theory covers at least four parts (as explained in The 
Nature of Statistical Learning Theory ): 

• Theory of consistency of learning processes 

• What are (necessary and sufficient) conditions 
for consistency of a learning process based on 
the empirical risk minimization principle? 

• Nonasymptotic theory of the rate of convergence of 
learning processes 

• How fast is the rate of convergence of the 
learning process? 

• Theory of controlling the generalization ability of 
learning processes 

• How can one control the rate of convergence 
(the generalization ability) of the learning pro¬ 
cess? 

• Theory of constructing learning machines 

• How can one construct algorithms that can 
control the generalization ability? 

VC Theory is a major subbranch of statistical learning 
theory. One of its main applications in statistical learning 
theory is to provide generalization conditions for learning 
algorithms. From this point of view, VC theory is related 
to stability, which is an alternative approach for charac¬ 
terizing generalization. 


In addition, VC theory and VC dimension are instrumen¬ 
tal in the theory of empirical processes, in the case of 
processes indexed by VC classes. Arguably these are the 
most important applications of the VC theory, and are 
employed in proving generalization. Several techniques 
will be introduced that are widely used in the empirical 
process and VC theory. The discussion is mainly based on 
the book “Weak Convergence and Empirical Processes: 
With Applications to Statistics”. 

23.2 Overview of VC theory in Em¬ 
pirical Processes 

23.2.1 Background on Empirical Pro¬ 
cesses 

Let Xi,..., X n be random elements defined on a mea¬ 
surable space (V, A) . For a measure Q set: 

Qf = J fdQ 

Measurability issues, will be ignored here, for more tech¬ 
nical detail see . Let T be a class of measurable functions 
/ : X —> R and define: 

WQWf = sup{|(5/| : / e X}. 

Define the empirical measure 

n 

Pn = n~ i y^Jx i , 

i= 1 

where 6 here stands for the Dirac measure. The empirical 
measure induces a map T —> R given by: 

/ Pn/ 

Now suppose P is the underlying true distribution of the 
data, which is unknown. Empirical Processes theory aims 
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at identifying classes T for which statements such as the 
following hold: 

• uniform law of large numbers: 

||P„ - P\\? -»• 0, 

• uniform central limit theorem: 


23.2.2 Symmetrization 

The majority of the arguments of how to bound the 
empirical process, rely on symmetrization, maximal and 
concentration inequalities and chaining. Symmetrization 
is usually the first step of the proofs, and since it is used 
in many machine learning proofs on bounding empirical 
loss functions (including the proof of the VC inequality 
which is discussed in the next section) it is presented here. 

Consider the empirical process: 


G„ = y/n(P n - P) G, M°°(F) 

In the former case T is called Glivenko-Cantelli 
class, and in the latter case (under the assumption 
\/x. sup^ e jr \f{x) — P/| < oo ) the class F is called 
Donsker or P-Donsker. Obviously, a Donsker class is 
Glivenko-Cantelli in probability by an application of 
Slutsky’s theorem . 

These statements are true for a single / , by standard 
LLN, CLT arguments under regularity conditions, and 
the difficulty in the Empirical Processes comes in because 
joint statements are being made for all / G T . Intuitively 
then, the set T cannot be too large, and as it turns out that 
the geometry of T plays a very important role. 

One way of measuring how big the function set T is to use 
the so-called covering numbers. The covering number 


1 " 

f^(p n -p)f = -Y l (f(x i )-pf) 

1 i =1 

Turns out that there is a connection between the empirical 
and the following symmetrized process: 

i n 

= -!>/(*) 

1 2 = 1 

The symmetrized process is a Rademacher process, con¬ 
ditionally on the data Xi . Therefore it is a sub-Gaussian 
process by Hoeffding’s inequality. 

Lemma (Symmetrization). For every nondecreasing, 
convex <!>: R —> R and class of measurable functions T , 


is the minimal number of balls {g : \\g — f\\ < e} needed 
to cover the set T (here it is obviously assumed that there 
is an underlying norm on J). The entropy is the loga¬ 
rithm of the covering number. 

Two sufficient conditions are provided below, under 
which it can be proved that the set T is Glivenko-Cantelli 
or Donsker. 


E$(||P n -P||^) <E$ (2||P°y 

The proof of the Symmetrization lemma relies on intro¬ 
ducing independent copies of the original variables Xi 
(sometimes referred to as a ghost sample) and replacing 
the inner expectation of the LHS by these copies. After 
an application of Jensen’s inequality different signs could 
be introduced (hence the name symmetrization) without 
changing the expectation. The proof can be found below 
because of its instructive nature. 


A class T is P-Glivenko-Cantelli if it is P-measurable [Proof] 
with envelope F such that P*F < oo and satisfies: 


Ve > 0 supg JV(£||F||q, J 7 , Li(<5)) < oo. 

The next condition is a version of the celebrated Dudley’s 
theorem. If J 7 is a class of functions such that 


sup Q yJlogN (£||J 7 ||q, 2 , jF, L 2 {Q))d£ 


< oo 


then T is P-Donsker for every probability measure P such 
that P*F 2 < oo . In the last integral, the notation means 



2 


Introduce the “ghost sample” Yj,..., Y n to be inde¬ 
pendent copies of Xi,..., X n . For fixed values of 
Xi,, X n one has: 


||Pn--P|b = SUp - 

n 

J2f(Xi)-Ef(Y t ) 

< Ey SUp — 


2=1 

feJ 7 n 


Therefore by Jensen’s inequality: 


*(||P„ - P||^) < 


J2f(Xi)-f(Yi) 


i= 1 


T* 


II/IIq,2 = 


Taking expectation with respect to X gives: 
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E$(||P n -P||^) < E x E y $ 


1 " 

-£/«)-/«) 


Which is a polynomial number 0(n v ^~ 1 ) of subsets 

I ather than an exponential number. Intuitively this means 
hat a finite VC-index implies that C has an apparent sim- 
ilistic structure. 


Note that adding a minus sign in front of a term f(Xf) — 
f(Yi) doesn't change the RHS, because it’s a symmetric 
function of X and Y . Therefore the RHS remains the 
same under “sign perturbation": 


E$ 



i=l 



for any (ei, e 2 , ■ ■., e n ) G {—1,1}" ■ Therefore: 


A similar bound can be shown (with a different constant, 
same rate) for the so-called VC subgraph classes. For a 
function / : X —> R the subgraph is a subset of X x R 
such that: {(x,t) : t < f(x)} . A collection of T is 
called a VC subgraph class if all subgraphs form a VC- 
class. 

Consider a set of indicator functions Iq = {1 c '■ C £ C} 
in Li(Q ) for discrete empirical type of measure Q (or 
equivalently for any probability measure Q). It can then 
be shown that quite remarkably, for r > 1 : 


E$(||P n -P||^) < E e E$ 


1 n 

-'52e i f(X i )-f(Y i ) 


Finally using first triangle inequality and then convexity 
of $ gives: 


]N(e,Ic,L r {Q)) < KV{C)(Ae) v ^£- r ^~^ 

Further consider the symmetric convex hull of a set T 
: sconv T being the collection of functions of the form 

Ya= 1 a ifi With XX1 \ a i\ < 1 ■ Then if 


E<h(||P n —P||jr) < -E e E$ 2 


-J2e t f(X i ) 

n ^^ 


T/ 


Where the last two expressions on the RHS are the same, 
which concludes the proof. 


A typical way of proving empirical CLTs, first uses 
symmetrization to pass the empirical process to P° and 
then argue conditionally on the data, using the fact that 
Rademacher processes are simple processes with nice 
properties. 


23.2.3 VC Connection 

It turns out that there is a fascinating connection between 
certain combinatorial properties of the set T and the en¬ 
tropy numbers. Uniform covering numbers can be con¬ 
trolled by the notion of Vapnik-Cervonenkis classes of sets 
- or shortly VC sets. 

Take a collection of subsets of the sample space X - C . 
A collection of sets C is said to pick out a certain subset 
of the finite set S = {x\, ..., x n } C X if S = S H C 
for some C £ C . C is said to shatter S if it picks out each 
of its 2" subsets. The VC-index (similar to VC dimension 
+ 1 for an appropriately chosen classifier set) V(C) of C 
is the smallest n for which no set of size n is shattered by 
C . 

Sauer’s lemma then states that the number 
A n (C, xi ,..., x n ) of subsets picked out by a VC- 
class C satisfies: 



ft. 




,-v 


the following is valid for the convex hull of T : 


logA^(e||P||Q i 2,sconvJ r ,L 2 (Q)) < Ke v + 2 

The important consequence of this fact is that 


which is just enough so that the entropy integral is going 
to converge, and therefore the class sconv T is going to 
be P-Donsker. 

Finally an example of a VC-subgraph class is consid¬ 
ered. Any finite-dimensional vector space T of measur¬ 
able functions / : X —> R is VC-subgraph of index 
smaller than or equal to dim(J r ) + 2 . 

[Proof] 


Taken = dim(J r )+2points (xi,ti), ..., ( x n ,t n ). The 
vectors: 


(f{xi),---,f{x n )) - {tx,...,t n ) 

are in a n— 1 dimensional subspace of R". Take a ± 0, a 
vector that is orthogonal to this subspace. Therefore: 


max A n (C, x\ ,..., x n ) E 

X±,...,X n 


V(C)~ 1 

E 

i=o 


< 


V(C) - 1 


V(C)-1 

= £ (-“MO*)-*), 


>0 


CLi< 0 


V/eJ 
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Consider the set S = {(x-i,ti) : a* > 0} . This set 
cannot be picked out since if there is some / such that 
S = {( Xi,ti ) : f(xi ) > ti} that would imply that the 
LHS is strictly positive but the RHS is non-negative. 


P 


sup 


RnU) ~ R(f) 



< 8 S{P,n)e~ n£ ' 2 ^ 2 


There are generalizations of the notion VC subgraph 
class, e.g. there is the notion of pseudo-dimension. The 
interested reader can look into. 


E 

sup 

Rn(f)-RU) 


feT 



< 2 


log S ( P, n ) + log 2 


23.3 VC Inequality 

A similar setting is considered, which is more common 
to machine learning. Let A" is a feature space and y = 
{0,1} . A function / : X —> y is called a classifier. 
Let P be a set of classifiers. Similarly to the previous 
section, define the shattering coefficient (also known as 
growth function): 


In words the VC inequality is saying that as the sample 
increases, provided that P has a finite VC dimension, the 
empirical 0/1 risk becomes a good proxy for the expected 
0/1 risk. Note that both RHS of the two inequalities will 
converge to 0, provided that S (P, n) grows polynomially 
in n. 

The connection between this framework and the Empir¬ 
ical Process framework is evident. Here one is dealing 
with a modified empirical process 


S{P,n) = max \{(f(x 1 ),...,f(x n )),f£P}\ 

Xi,...,X n 

Note here that there is a 1:1 mapping between each of 
the functions in P and the set on which the function is 1. 
Therefore in terms of the previous section the shattering 
coefficient is precisely 


Rn - R 


T 


but not surprisingly the ideas are the same. The proof of 
the (first part of) VC inequality, relies on symmetrization, 
and then argue conditionally on the data using concen¬ 
tration inequalities (in particular Hoeffding’s inequality). 
The interested reader can check the book Theorems 12.4 
and 12.5. 


max A n (C , X\ ,..., x n ) 

Xi,...,X n 

for C being the collection of all sets described above. 
Now for the same reasoning as before, namely using 
Sauer’s Lemma it can be shown that S(P. n) is going to 
be polynomial in n provided that the class P has a finite 
VC-dimension or equivalently the collection C has finite 
VC-index. 

Let D n = { (Xi, Yl (X n ,Y m )} is an observed 
dataset. Assume that the data is generated by an un¬ 
known probability distribution Pxy ■ Define R(f) = 
P(f(X) f Y) to be the expected 0/1 loss. Of course 
since Pxy is unknown in general, one has no access to 
R(f) . However the empirical risk , given by: 


1 

Rn{f) = -^I(/(X„) ±Y n ) 

n z —' 

2=1 

can certainly be evaluated. Then one has the following 
Theorem: 


23.3.1 Theorem (VC Inequality) 

For binary classification and the 0/1 loss function we have 
the following generalization bounds: 
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Chapter 24 

Probably approximately correct learning 


In computational learning theory, probably approxi¬ 
mately correct learning (PAC learning) is a framework 
for mathematical analysis of machine learning. It was 
proposed in 1984 by Leslie Valiant. 1 * 1 

In this framework, the learner receives samples and must 
select a generalization function (called the hypothesis ) 
from a certain class of possible functions. The goal is 
that, with high probability (the “probably” part), the se¬ 
lected function will have low generalization error (the 
“approximately correct” part). The learner must be able 
to learn the concept given any arbitrary approximation ra¬ 
tio, probability of success, or distribution of the samples. 

The model was later extended to treat noise (misclassified 
samples). 

An important innovation of the PAC framework is the in¬ 
troduction of computational complexity theory concepts 
to machine learning. In particular, the learner is ex¬ 
pected to find efficient functions (time and space require¬ 
ments bounded to a polynomial of the example size), and 
the learner itself must implement an efficient procedure 
(requiring an example count bounded to a polynomial 
of the concept size, modified by the approximation and 
likelihood bounds). 


24.1 Definitions and terminology 

In order to give the definition for something that is PAC- 

learnable, we first have to introduce some terminology. 121 

[ 3 ] 

For the following definitions, two examples will be used. 
The first is the problem of character recognition given an 
array of n bits encoding a binary-valued image. The other 
example is the problem of finding an interval that will 
correctly classify points within the interval as positive and 
the points outside of the range as negative. 

Let X be a set called the instance space or the encoding 
of all the samples, and each instance have length assigned. 
In the character recognition problem, the instance space is 
X = {0,1}™ . In the interval problem the instance space 
is X = K , where R denotes the set of all real numbers. 

A concept is a subset c C X . One concept is the set of 


all patterns of bits in X = {0,1}™ that encode a picture 
of the letter “P”. An example concept from the second 
example is the set of all of the numbers between 7 r /2 and 
x/lO . A concept class C is a set of concepts over X . This 
could be the set of all subsets of the array of bits that are 
skeletonized 4-connected (width of the font is 1). 

Let EX(c, D) be a procedure that draws an example, x 
, using a probability distribution D and gives the correct 
label c{x) , that is 1 if x £ c and 0 otherwise. 

Say that there is an algorithm A that given access to 
EX{c, D) and inputs e and 6 that, with probability of 
at least 1 — S , A outputs a hypothesis h £ C that has 
error less than or equal to e with examples drawn from X 
with the distribution D . If there is such an algorithm for 
every concept c £ C , for every distribution D over X 
, and for all 0 < e < 1/2 and 0 < <5 < 1/2 then C is 
PAC learnable (or distribution-free PAC learnable). We 
can also say that A is a PAC learning algorithm for C . 

An algorithm runs in time t if it draws at most t exam¬ 
ples and requires at most t time steps. A concept class 
is efficiently PAC learnable if it is PAC learnable by an 
algorithm that runs in time polynomial in 1 /e , 1 / <5 and 
instance length. 


24.2 Equivalence 

Under some regularity conditions these three conditions 
are equivalent: 

1. The concept class C is PAC learnable. 

2. The VC dimension of C is finite. 

3. C is a uniform Glivenko-Cantelli class. 
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sic Books, 2013. In which Valiant argues that PAC 
learning describes how organisms evolve and learn. 


Chapter 25 

Algorithmic learning theory 


Algorithmic learning theory is a mathematical frame¬ 
work for analyzing machine learning problems and algo¬ 
rithms. Synonyms include formal learning theory and 
algorithmic inductive inference. Algorithmic learning 
theory is different from statistical learning theory in that 
it does not make use of statistical assumptions and anal¬ 
ysis. Both algorithmic and statistical learning theory are 
concerned with machine learning and can thus be viewed 
as branches of computational learning theory. 

25.1 Distinguishing Characteris¬ 
tics 

Unlike statistical learning theory and most statistical the¬ 
ory in general, algorithmic learning theory does not as¬ 
sume that data are random samples, that is, that data 
points are independent of each other. This makes the 
theory suitable for domains where observations are (rela¬ 
tively) noise-free but not random, such as language learn¬ 
ing 111 and automated scientific discovery. 121131 

The fundamental concept of algorithmic learning theory 
is learning in the limit: as the number of data points in¬ 
creases, a learning algorithm should converge to a cor¬ 
rect hypothesis on every possible data sequence consistent 
with the problem space. This is a non-probabilistic ver¬ 
sion of statistical consistency, which also requires conver¬ 
gence to a correct model in the limit, but allows a learner 
to fail on data sequences with probability measure 0. 

Algorithmic learning theory investigates the learning 
power of Turing machines. Other frameworks consider 
a much more restricted class of learning algorithms than 
Turing machines, for example learners that compute hy¬ 
potheses more quickly, for instance in polynomial time. 
An example of such a framework is probably approxi¬ 
mately correct learning. 

25.2 Learning in the limit 

The concept was introduced in E. Mark Gold's seminal 
paper "Language identification in the limit". 141 The ob¬ 
jective of language identification is for a machine run¬ 


ning one program to be capable of developing another 
program by which any given sentence can be tested to 
determine whether it is “grammatical” or “ungrammat¬ 
ical”. The language being learned need not be English 
or any other natural language - in fact the definition of 
“grammatical” can be absolutely anything known to the 
tester. 

In Gold’s learning model, the tester gives the learner an 
example sentence at each step, and the learner responds 
with a hypothesis, which is a suggested program to deter¬ 
mine grammatical correctness. It is required of the tester 
that every possible sentence (grammatical or not) appears 
in the list eventually, but no particular order is required. It 
is required of the learner that at each step the hypothesis 
must be correct for all the sentences so far. 

A particular learner is said to be able to “learn a language 
in the limit” if there is a certain number of steps beyond 
which its hypothesis no longer changes. At this point it 
has indeed learned the language, because every possible 
sentence appears somewhere in the sequence of inputs 
(past or future), and the hypothesis is correct for all inputs 
(past or future), so the hypothesis is correct for every sen¬ 
tence. The learner is not required to be able to tell when 
it has reached a correct hypothesis, all that is required is 
that it be true. 

Gold showed that any language which is defined by a 
Turing machine program can be learned in the limit 
by another Turing-complete machine using enumeration. 
This is done by the learner testing all possible Turing ma¬ 
chine programs in turn until one is found which is cor¬ 
rect so far - this forms the hypothesis for the current 
step. Eventually, the correct program will be reached, 
after which the hypothesis will never change again (but 
note that the learner does not know that it won't need to 
change). 

Gold also showed that if the learner is given only posi¬ 
tive examples (that is, only grammatical sentences appear 
in the input, not ungrammatical sentences), then the lan¬ 
guage can only be guaranteed to be learned in the limit if 
there are only a finite number of possible sentences in the 
language (this is possible if, for example, sentences are 
known to be of limited length). 

Language identification in the limit is a highly abstract 
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model. It does not allow for limits of runtime or computer 
memory which can occur in practice, and the enumera¬ 
tion method may fail if there are errors in the input. How¬ 
ever the framework is very powerful, because if these 
strict conditions are maintained, it allows the learning of 
any program known to be computable. This is because 
a Turing machine program can be written to mimic any 
program in any conventional programming language. See 
Church-Turing thesis. 

25.3 Other Identification Criteria 

Learning theorists have investigated other learning 
criteria, 151 such as the following. 

• Efficiency: minimizing the number of data points re¬ 
quired before convergence to a correct hypothesis. 

• Mind Changes: minimizing the number of hypothe¬ 
sis changes that occur before convergence. 161 

Mind change bounds are closely related to mistake 
bounds that are studied in statistical learning theory. 171 
Kevin Kelly has suggested that minimizing mind changes 
is closely related to choosing maximally simple hypothe¬ 
ses in the sense of Occam’s Razor. 181 


[7] Jain, S. and Sharma, A. (1999), On a generalized notion of 
mistake bounds. Proceedings of the Conference on Learn¬ 
ing Theory (COLT), pp.249-256. 

[8] Kevin T. Kelly (2007), Ockham's Razor, Empirical Com¬ 
plexity, and Truth-finding Efficiency, Theoretical Com¬ 
puter Science, 383: 270-289. 


25.6 External links 

• Learning Theory in Computer Science. 

• The Stanford Encyclopaedia of Philosophy provides 
a highly accessible introduction to key concepts in 
algorithmic learning theory, especially as they apply 
to the philosophical problems of inductive inference. 


25.4 See also 

• Sample exclusion dimension 
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Chapter 26 


Statistical hypothesis testing 


“Critical region” redirects here. For the computer 
science notion of a “critical section”, sometimes called a 
“critical region”, see critical section. 

A statistical hypothesis is a scientific hypothesis that 
is testable on the basis of observing a process that is 
modeled via a set of random variables. 111 A statistical 
hypothesis test is a method of statistical inference used 
for testing a statistical hypothesis. 

A test result is called statistically significant if it has been 
predicted as unlikely to have occurred by sampling er¬ 
ror alone, according to a threshold probability—the sig¬ 
nificance level. Hypothesis tests are used in determining 
what outcomes of a study would lead to a rejection of the 
null hypothesis for a pre-specified level of significance. 
In the Neyman-Pearson framework (see below), the pro¬ 
cess of distinguishing between the null hypothesis and the 
alternative hypothesis is aided by identifying two concep¬ 
tual types of errors (type 1 & type 2), and by specifying 
parametric limits on e.g. how much type 1 error will be 
permitted. 

An alternative framework for statistical hypothesis test¬ 
ing is to specify a set of statistical models, one for each 
candidate hypothesis, and then use model selection tech¬ 
niques to choose the most appropriate model. 121 The most 
common selection techniques are based on either Akaike 
information criterion or Bayes factor. 

Statistical hypothesis testing is sometimes called con¬ 
firmatory data analysis. It can be contrasted with 
exploratory data analysis, which may not have pre¬ 
specified hypotheses. 

26.1 Variations and sub-classes 

Statistical hypothesis testing is a key technique of both 
Frequentist inference and Bayesian inference, although 
the two types of inference have notable differences. Sta¬ 
tistical hypothesis tests define a procedure that controls 
(fixes) the probability of incorrectly deciding that a de¬ 
fault position (null hypothesis) is incorrect. The proce¬ 
dure is based on how likely it would be for a set of obser¬ 
vations to occur if the null hypothesis were true. Note that 


this probability of making an incorrect decision is not the 
probability that the null hypothesis is true, nor whether 
any specific alternative hypothesis is true. This contrasts 
with other possible techniques of decision theory in which 
the null and alternative hypothesis are treated on a more 
equal basis. 

One naive Bayesian approach to hypothesis testing is to 
base decisions on the posterior probability, 131141 but this 
fails when comparing point and continuous hypotheses. 
Other approaches to decision making, such as Bayesian 
decision theory, attempt to balance the consequences of 
incorrect decisions across all possibilities, rather than 
concentrating on a single null hypothesis. A number of 
other approaches to reaching a decision based on data are 
available via decision theory and optimal decisions, some 
of which have desirable properties. Hypothesis testing, 
though, is a dominant approach to data analysis in many 
fields of science. Extensions to the theory of hypothe¬ 
sis testing include the study of the power of tests, i.e. 
the probability of correctly rejecting the null hypothesis 
given that it is false. Such considerations can be used for 
the purpose of sample size determination prior to the col¬ 
lection of data. 


26.2 The testing process 

In the statistics literature, statistical hypothesis testing 
plays a fundamental role. 151 The usual line of reasoning 
is as follows: 

1. There is an initial research hypothesis of which the 
truth is unknown. 

2. The first step is to state the relevant null and alter¬ 
native hypotheses. This is important as mis-stating 
the hypotheses will muddy the rest of the process. 

3. The second step is to consider the statistical assump¬ 
tions being made about the sample in doing the test; 
for example, assumptions about the statistical inde¬ 
pendence or about the form of the distributions of 
the observations. This is equally important as invalid 
assumptions will mean that the results of the test are 
invalid. 
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4. Decide which test is appropriate, and state the rele¬ 
vant test statistic T. 

5. Derive the distribution of the test statistic under the 
null hypothesis from the assumptions. In standard 
cases this will be a well-known result. For example 
the test statistic might follow a Student’s t distribu¬ 
tion or a normal distribution. 

6. Select a significance level (a), a probability thresh¬ 
old below which the null hypothesis will be rejected. 
Common values are 5% and 1%. 

7. The distribution of the test statistic under the null 
hypothesis partitions the possible values of T into 
those for which the null hypothesis is rejected—the 
so-called critical region —and those for which it is 
not. The probability of the critical region is a. 

8. Compute from the observations the observed value 
to bs of the test statistic T. 

9. Decide to either reject the null hypothesis in favor of 
the alternative or not reject it. The decision rule is 
to reject the null hypothesis Hq if the observed value 
to bs is in the critical region, and to accept or “fail to 
reject” the hypothesis otherwise. 

An alternative process is commonly used: 

1. Compute from the observations the observed value 
to bs of the test statistic T. 

2. Calculate the p-value. This is the probability, un¬ 
der the null hypothesis, of sampling a test statistic at 
least as extreme as that which was observed. 

3. Reject the null hypothesis, in favor of the alternative 
hypothesis, if and only if the p-value is less than the 
significance level (the selected probability) thresh¬ 
old. 

The two processes are equivalent. 161 The former process 
was advantageous in the past when only tables of test 
statistics at common probability thresholds were avail¬ 
able. It allowed a decision to be made without the cal¬ 
culation of a probability. It was adequate for classwork 
and for operational use, but it was deficient for reporting 
results. 

The latter process relied on extensive tables or on compu¬ 
tational support not always available. The explicit calcu¬ 
lation of a probability is useful for reporting. The calcu¬ 
lations are now trivially performed with appropriate soft¬ 
ware. 

The difference in the two processes applied to the Ra¬ 
dioactive suitcase example (below): 

• “The Geiger-counter reading is 10. The limit is 9. 
Check the suitcase.” 


• “The Geiger-counter reading is high; 97% of safe 
suitcases have lower readings. The limit is 95%. 
Check the suitcase.” 

The former report is adequate, the latter gives a more de¬ 
tailed explanation of the data and the reason why the suit¬ 
case is being checked. 

It is important to note the difference between accepting 
the null hypothesis and simply failing to reject it. The 
“fail to reject” terminology highlights the fact that the null 
hypothesis is assumed to be true from the start of the test; 
if there is a lack of evidence against it, it simply contin¬ 
ues to be assumed true. The phrase “accept the null hy¬ 
pothesis” may suggest it has been proved simply because 
it has not been disproved, a logical fallacy known as the 
argument from ignorance. Unless a test with particularly 
high power is used, the idea of “accepting” the null hy¬ 
pothesis may be dangerous. Nonetheless the terminol¬ 
ogy is prevalent throughout statistics, where its meaning 
is well understood. 

The processes described here are perfectly adequate for 
computation. They seriously neglect the design of exper¬ 
iments considerations. |7||8) 

It is particularly critical that appropriate sample sizes be 
estimated before conducting the experiment. 

The phrase “test of significance” was coined by statisti¬ 
cian Ronald Fisher. 191 


26.2.1 Interpretation 

If the p-value is less than the required significance level 
(equivalently, if the observed test statistic is in the criti¬ 
cal region), then we say the null hypothesis is rejected at 
the given level of significance. Rejection of the null hy¬ 
pothesis is a conclusion. This is like a “guilty” verdict in 
a criminal trial: the evidence is sufficient to reject inno¬ 
cence, thus proving guilt. We might accept the alternative 
hypothesis (and the research hypothesis). 

If the p-value is not less than the required significance 
level (equivalently, if the observed test statistic is outside 
the critical region), then the test has no result. The evi¬ 
dence is insufficient to support a conclusion. (This is like 
a jury that fails to reach a verdict.) The researcher typ¬ 
ically gives extra consideration to those cases where the 
p-value is close to the significance level. 

In the Lady tasting tea example (below), Fisher required 
the Lady to properly categorize all of the cups of tea to 
justify the conclusion that the result was unlikely to result 
from chance. He defined the critical region as that case 
alone. The region was defined by a probability (that the 
null hypothesis was correct) of less than 5%. 

Whether rejection of the null hypothesis truly justifies ac¬ 
ceptance of the research hypothesis depends on the struc¬ 
ture of the hypotheses. Rejecting the hypothesis that a 
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large paw print originated from a bear does not immedi¬ 
ately prove the existence of Bigfoot. Hypothesis testing 
emphasizes the rejection, which is based on a probability, 
rather than the acceptance, which requires extra steps of 
logic. 

“The probability of rejecting the null hypothesis is a 
function of five factors: whether the test is one- or two 
tailed, the level of significance, the standard deviation, 
the amount of deviation from the null hypothesis, and the 
number of observations. ” [101 These factors are a source 
of criticism; factors under the control of the experi¬ 
menter/analyst give the results an appearance of subjec¬ 
tivity. 

26.2.2 Use and importance 

Statistics are helpful in analyzing most collections of data. 
This is equally true of hypothesis testing which can jus¬ 
tify conclusions even when no scientific theory exists. In 
the Lady tasting tea example, it was “obvious” that no dif¬ 
ference existed between (milk poured into tea) and (tea 
poured into milk). The data contradicted the “obvious”. 

Real world applications of hypothesis testing include: 1111 

• Testing whether more men than women suffer from 
nightmares 

• Establishing authorship of documents 

• Evaluating the effect of the full moon on behavior 

• Determining the range at which a bat can detect an 
insect by echo 

• Deciding whether hospital carpeting results in more 
infections 

• Selecting the best means to stop smoking 

• Checking whether bumper stickers reflect car owner 
behavior 

• Testing the claims of handwriting analysts 

Statistical hypothesis testing plays an important role in the 
whole of statistics and in statistical inference. For exam¬ 
ple, Lehmann (1992) in a review of the fundamental pa¬ 
per by Neyman and Pearson (1933) says; “Nevertheless, 
despite their shortcomings, the new paradigm formulated 
in the 1933 paper, and the many developments carried 
out within its framework continue to play a central role 
in both the theory and practice of statistics and can be 
expected to do so in the foreseeable future”. 

Significance testing has been the favored statistical tool 
in some experimental social sciences (over 90% of arti¬ 
cles in the Journal of Applied Psychology during the early 
1990s). 1 12 Other fields have favored the estimation of pa¬ 
rameters (e.g., effect size). Significance testing is used as 


CHAPTER 26. STATISTICAL HYPOTHESIS TESTING 

a substitute for the traditional comparison of predicted 
value and experimental result at the core of the scientific 
method. When theory is only capable of predicting the 
sign of a relationship, a directional (one-sided) hypothesis 
test can be configured so that only a statistically signifi¬ 
cant result supports theory. This form of theory appraisal 
is the most heavily criticized application of hypothesis 
testing. 

26.2.3 Cautions 

“If the government required statistical procedures to 
carry warning labels like those on drugs, most inference 
methods would have long labels indeed.’’^ 131 This caution 
applies to hypothesis tests and alternatives to them. 

The successful hypothesis test is associated with a prob¬ 
ability and a type-I error rate. The conclusion might be 
wrong. 

The conclusion of the test is only as solid as the sample 
upon which it is based. The design of the experiment is 
critical. A number of unexpected effects have been ob¬ 
served including: 

• The Clever Hans effect. A horse appeared to be ca¬ 
pable of doing simple arithmetic. 

• The Hawthorne effect. Industrial workers were 
more productive in better illumination, and most 
productive in worse. 

• The Placebo effect. Pills with no medically active 
ingredients were remarkably effective. 

A statistical analysis of misleading data produces mis¬ 
leading conclusions. The issue of data quality can be 
more subtle. In forecasting for example, there is no agree¬ 
ment on a measure of forecast accuracy. In the absence 
of a consensus measurement, no decision based on mea¬ 
surements will be without controversy. 

The book How to Lie with Statistics' 1411 l s 1 is the most pop¬ 
ular book on statistics ever published. 116] It does not much 
consider hypothesis testing, but its cautions are applica¬ 
ble, including: Many claims are made on the basis of sam¬ 
ples too small to convince. If a report does not mention 
sample size, be doubtful. 

Hypothesis testing acts as a filter of statistical conclusions; 
only those results meeting a probability threshold are pub¬ 
lishable. Economics also acts as a publication filter; only 
those results favorable to the author and funding source 
may be submitted for publication. The impact of filter¬ 
ing on publication is termed publication bias. A related 
problem is that of multiple testing (sometimes linked to 
data mining), in which a variety of tests for a variety of 
possible effects are applied to a single data set and only 
those yielding a significant result are reported. These are 
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often dealt with by using multiplicity correction proce¬ 
dures that control the family wise error rate (FWER) or 
the false discovery rate (FDR). 

Those making critical decisions based on the results of 
a hypothesis test are prudent to look at the details rather 
than the conclusion alone. In the physical sciences most 
results are fully accepted only when independently con¬ 
firmed. The general advice concerning statistics is, “Fig¬ 
ures never lie, but liars figure” (anonymous). 

26.3 Example 

26.3.1 Lady tasting tea 

Main article: Lady tasting tea 

In a famous example of hypothesis testing, known as the 
Lady tasting fee/, 1 17 a female colleague of Fisher claimed 
to be able to tell whether the tea or the milk was added 
first to a cup. Fisher proposed to give her eight cups, four 
of each variety, in random order. One could then ask 
what the probability was for her getting the number she 
got correct, but just by chance. The null hypothesis was 
that the Lady had no such ability. The test statistic was a 
simple count of the number of successes in selecting the 
4 cups. The critical region was the single case of 4 suc¬ 
cesses of 4 possible based on a conventional probability 
criterion (< 5%; 1 of 70 « 1.4%). Fisher asserted that no 
alternative hypothesis was (ever) required. The lady cor¬ 
rectly identified every cup, 1181 which would be considered 
a statistically significant result. 

26.3.2 Analogy - Courtroom trial 

A statistical test procedure is comparable to a criminal 
trial; a defendant is considered not guilty as long as his 
or her guilt is not proven. The prosecutor tries to prove 
the guilt of the defendant. Only when there is enough 
charging evidence the defendant is convicted. 

In the start of the procedure, there are two hypotheses H 0 
: “the defendant is not guilty”, and II \ : “the defendant 
is guilty”. The first one is called null hypothesis , and is 
for the time being accepted. The second one is called 
alternative (hypothesis). It is the hypothesis one hopes to 
support. 

The hypothesis of innocence is only rejected when an er¬ 
ror is very unlikely, because one doesn't want to convict 
an innocent defendant. Such an error is called error of the 
first kind (i.e., the conviction of an innocent person), and 
the occurrence of this error is controlled to be rare. As a 
consequence of this asymmetric behaviour, the error of 
the second kind (acquitting a person who committed the 
crime), is often rather large. 

A criminal trial can be regarded as either or both of two 


decision processes: guilty vs not guilty or evidence vs a 
threshold (“beyond a reasonable doubt”). In one view, the 
defendant is judged; in the other view the performance 
of the prosecution (which bears the burden of proof) is 
judged. A hypothesis test can be regarded as either a 
judgment of a hypothesis or as a judgment of evidence. 

26.3.3 Example 1 - Philosopher’s beans 

The following example was produced by a philosopher 
describing scientific methods generations before hypoth¬ 
esis testing was formalized and popularized. 1191 

Few beans of this handful are white. 

Most beans in this bag are white. 

Therefore: Probably, these beans were taken 

from another bag. 

This is an hypothetical inference. 

The beans in the bag are the population. The handful are 
the sample. The null hypothesis is that the sample origi¬ 
nated from the population. The criterion for rejecting the 
null-hypothesis is the “obvious” difference in appearance 
(an informal difference in the mean). The interesting re¬ 
sult is that consideration of a real population and a real 
sample produced an imaginary bag. The philosopher was 
considering logic rather than probability. To be a real 
statistical hypothesis test, this example requires the for¬ 
malities of a probability calculation and a comparison of 
that probability to a standard. 

A simple generalization of the example considers a mixed 
bag of beans and a handful that contain either very few 
or very many white beans. The generalization considers 
both extremes. It requires more calculations and more 
comparisons to arrive at a formal answer, but the core phi¬ 
losophy is unchanged; If the composition of the handful 
is greatly different from that of the bag, then the sample 
probably originated from another bag. The original ex¬ 
ample is termed a one-sided or a one-tailed test while the 
generalization is termed a two-sided or two-tailed test. 

The statement also relies on the inference that the sam¬ 
pling was random. If someone had been picking through 
the bag to find white beans, then it would explain why the 
handful had so many white beans, and also explain why 
the number of white beans in the bag was depleted (al¬ 
though the bag is probably intended to be assumed much 
larger than one’s hand). 

26.3.4 Example 2 - Clairvoyant card 
game [20] 

A person (the subject) is tested for clairvoyance. He is 
shown the reverse of a randomly chosen playing card 25 
times and asked which of the four suits it belongs to. The 
number of hits, or correct answers, is called X. 
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As we try to find evidence of his clairvoyance, for the 
time being the null hypothesis is that the person is not 
clairvoyant. The alternative is, of course: the person is 
(more or less) clairvoyant. 

If the null hypothesis is valid, the only thing the test per¬ 
son can do is guess. For every card, the probability (rela¬ 
tive frequency) of any single suit appearing is 1/4. If the 
alternative is valid, the test subject will predict the suit 
correctly with probability greater than 1/4. We will call 
the probability of guessing correctly p. The hypotheses, 
then, are: 


For example, if we select an error rate of 1%, c is calcu¬ 
lated thus: 

P( reject7To|-ffo va lid is ) = P(X > c\p = |) < 0.01. 

From all the numbers c, with this property, we choose the 
smallest, in order to minimize the probability of a Type II 
error, a false negative. For the above example, we select: 
c = 13 . 

26.3.5 Example 3 - Radioactive suitcase 


null hypothesis : Hq : p = \ (just guessing) 


and 


alternative hypothesis :Hi 
ant). 


p \ (true clairvoy- 


When the test subject correctly predicts all 25 cards, we 
will consider him clairvoyant, and reject the null hypoth¬ 
esis. Thus also with 24 or 23 hits. With only 5 or 6 hits, 
on the other hand, there is no cause to consider him so. 
But what about 12 hits, or 17 hits? What is the critical 
number, c, of hits, at which point we consider the sub¬ 
ject to be clairvoyant? How do we determine the critical 
value c? It is obvious that with the choice c=25 (i.e. we 
only accept clairvoyance when all cards are predicted cor¬ 
rectly) we're more critical than with c=10. In the first case 
almost no test subjects will be recognized to be clairvoy¬ 
ant, in the second case, a certain number will pass the test. 
In practice, one decides how critical one will be. That is, 
one decides how often one accepts an error of the first 
kind - a false positive, or Type I error. With c = 25 the 
probability of such an error is: 


P( rejects|invalid is ) = P(X = 25 |p = \) = (|) 


25 


and hence, very small. The probability of a false positive 
is the probability of randomly guessing correctly all 25 
times. 

Being less critical, with c=10, gives: 


25 


P( rejectJT 0 1JT 0 valid is ) = P(X > 10 |p = \) = ) P(d& 


As an example, consider determining whether a suit¬ 
case contains some radioactive material. Placed under 
a Geiger counter, it produces 10 counts per minute. The 
null hypothesis is that no radioactive material is in the 
suitcase and that all measured counts are due to ambient 
radioactivity typical of the surrounding air and harmless 
objects. We can then calculate how likely it is that we 
would observe 10 counts per minute if the null hypothesis 
were true. If the null hypothesis predicts (say) on average 
9 counts per minute, then according to the Poisson distri¬ 
bution typical for radioactive decay there is about 41% 
chance of recording 10 or more counts. Thus we can say 
that the suitcase is compatible with the null hypothesis 
(this does not guarantee that there is no radioactive ma¬ 
terial, just that we don't have enough evidence to suggest 
there is). On the other hand, if the null hypothesis pre¬ 
dicts 3 counts per minute (for which the Poisson distribu¬ 
tion predicts only 0.1% chance of recording 10 or more 
counts) then the suitcase is not compatible with the null 
hypothesis, and there are likely other factors responsible 
to produce the measurements. 

The test does not directly assert the presence of radioac¬ 
tive material. A successful test asserts that the claim of 
no radioactive material present is unlikely given the read¬ 
ing (and therefore ...). The double negative (disproving 
^fliiOiulf hypothesis) of the method is confusing, but using 
a counter-example to disprove is standard mathematical 
practice. The attraction of the method is its practical¬ 
ity. We know (from experience) the expected range of 
counts with only ambient radioactivity present, so we can 
say that a measurement is unusually large. Statistics just 
formalizes the intuitive by using numbers instead of ad¬ 
jectives. We probably do not know the characteristics of 
the radioactive suitcases; We just assume that they pro- 
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fc=10 

Thus, c = 10 yields a much greater probability of false 
positive. 

Before the test is actually performed, the maximum ac¬ 
ceptable probability of a Type I error (a) is determined. 
Typically, values in the range of 1% to 5% are selected. 
(If the maximum acceptable error rate is zero, an infinite 
number of correct guesses is required.) Depending on 
this Type 1 error rate, the critical value c is calculated. 


To slightly formalize intuition: Radioactivity is suspected 
if the Geiger-count with the suitcase is among or ex¬ 
ceeds the greatest (5% or 1%) of the Geiger-counts made 
with ambient radiation alone. This makes no assumptions 
about the distribution of counts. Many ambient radiation 
observations are required to obtain good probability esti¬ 
mates for rare events. 

The test described here is more fully the null-hypothesis 
statistical significance test. The null hypothesis repre- 
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sents what we would believe by default, before seeing any 
evidence. Statistical significance is a possible finding of 
the test, declared when the observed sample is unlikely 
to have occurred by chance if the null hypothesis were 
true. The name of the test describes its formulation and 
its possible outcome. One characteristic of the test is its 
crisp decision: to reject or not reject the null hypothesis. 
A calculated value is compared to a threshold, which is 
determined from the tolerable risk of error. 


26.4 Definition of terms 

The following definitions are mainly based on the expo¬ 
sition in the book by Lehmann and Romano: 151 

Statistical hypothesis A statement about the parame¬ 
ters describing a population (not a sample). 

Statistic A value calculated from a sample, often to 
summarize the sample for comparison purposes. 

Simple hypothesis Any hypothesis which specifies the 
population distribution completely. 

Composite hypothesis Any hypothesis which does not 
specify the population distribution completely. 

Null hypothesis (H 0 ) A simple hypothesis associated 
with a contradiction to a theory one would like to 
prove. 

Alternative hypothesis (Hi) A hypothesis (often com¬ 
posite) associated with a theory one would like to 
prove. 

Statistical test A procedure whose inputs are samples 
and whose result is a hypothesis. 

Region of acceptance The set of values of the test 
statistic for which we fail to reject the null hypothe¬ 
sis. 

Region of rejection / Critical region The set of values 
of the test statistic for which the null hypothesis is 
rejected. 

Critical value The threshold value delimiting the re¬ 
gions of acceptance and rejection for the test statis¬ 
tic. 

Power of a test (1 - ft) The test’s probability of cor¬ 
rectly rejecting the null hypothesis. The comple¬ 
ment of the false negative rate, ft. Power is termed 
sensitivity in biostatistics. (“This is a sensitive test. 
Because the result is negative, we can confidently say 
that the patient does not have the condition.”) See 
sensitivity and specificity and Type I and type II er¬ 
rors for exhaustive definitions. 


Size For simple hypotheses, this is the test’s probabil¬ 
ity of incorrectly rejecting the null hypothesis. The 
false positive rate. For composite hypotheses this 
is the supremum of the probability of rejecting the 
null hypothesis over all cases covered by the null hy¬ 
pothesis. The complement of the false positive rate 
is termed specificity in biostatistics. (“This is a spe¬ 
cific test. Because the result is positive, we can con¬ 
fidently say that the patient has the condition.”) See 
sensitivity and specificity and Type I and type II er¬ 
rors for exhaustive definitions. 

Significance level of a test (a) It is the upper bound 
imposed on the size of a test. Its value is chosen by 
the statistician prior to looking at the data or choos¬ 
ing any particular test to be used. It is the maximum 
exposure to erroneously rejecting Ho he/she is ready 
to accept. Testing H 0 at significance level a means 
testing Ho with a test whose size does not exceed a. 
In most cases, one uses tests whose size is equal to 
the significance level. 

p-value The probability, assuming the null hypothesis is 
true, of observing a result at least as extreme as the 
test statistic. 

Statistical significance test A predecessor to the sta¬ 
tistical hypothesis test (see the Origins section). An 
experimental result was said to be statistically signif¬ 
icant if a sample was sufficiently inconsistent with 
the (null) hypothesis. This was variously considered 
common sense, a pragmatic heuristic for identify¬ 
ing meaningful experimental results, a convention 
establishing a threshold of statistical evidence or a 
method for drawing conclusions from data. The sta¬ 
tistical hypothesis test added mathematical rigor and 
philosophical consistency to the concept by mak¬ 
ing the alternative hypothesis explicit. The term is 
loosely used to describe the modern version which 
is now part of statistical hypothesis testing. 

Conservative test A test is conservative if, when con¬ 
structed for a given nominal significance level, the 
true probability of incorrectly rejecting the null hy¬ 
pothesis is never greater than the nominal level. 

Exact test A test in which the significance level or criti¬ 
cal value can be computed exactly, i.e., without any 
approximation. In some contexts this term is re¬ 
stricted to tests applied to categorical data and to 
permutation tests, in which computations are car¬ 
ried out by complete enumeration of all possible out¬ 
comes and their probabilities. 

A statistical hypothesis test compares a test statistic (z or 
t for examples) to a threshold. The test statistic (the for¬ 
mula found in the table below) is based on optimality. For 
a fixed level of Type I error rate, use of these statistics 
minimizes Type II error rates (equivalent to maximizing 
power). The following terms describe tests in terms of 
such optimality: 
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Most powerful test For a given size or significance level , 
the test with the greatest power (probability of re¬ 
jection) for a given value of the parameter(s) being 
tested, contained in the alternative hypothesis. 

Uniformly most powerful test (UMP) A test with the 
greatest power for all values of the parameter(s) be¬ 
ing tested, contained in the alternative hypothesis. 

26.5 Common test statistics 

Main article: Test statistic 

One-sample tests are appropriate when a sample is be¬ 
ing compared to the population from a hypothesis. The 
population characteristics are known from theory or are 
calculated from the population. 

Two-sample tests are appropriate for comparing two 
samples, typically experimental and control samples from 
a scientifically controlled experiment. 

Paired tests are appropriate for comparing two sam¬ 
ples where it is impossible to control important variables. 
Rather than comparing two sets, members are paired be¬ 
tween samples so the difference between the members 
becomes the sample. Typically the mean of the differ¬ 
ences is then compared to zero. The common example 
scenario for when a paired difference test is appropriate 
is when a single set of test subjects has something applied 
to them and the test is intended to check for an effect. 

Z-tests are appropriate for comparing means under strin¬ 
gent conditions regarding normality and a known stan¬ 
dard deviation. 

A f-test is appropriate for comparing means under relaxed 
conditions (less is assumed). 

Tests of proportions are analogous to tests of means (the 
50% proportion). 

Chi-squared tests use the same calculations and the same 
probability distribution for different applications: 

• Chi-squared tests for variance are used to determine 
whether a normal population has a specified vari¬ 
ance. The null hypothesis is that it does. 

• Chi-squared tests of independence are used for de¬ 
ciding whether two variables are associated or are 
independent. The variables are categorical rather 
than numeric. It can be used to decide whether left- 
handedness is correlated with libertarian politics (or 
not). The null hypothesis is that the variables are 
independent. The numbers used in the calculation 
are the observed and expected frequencies of occur¬ 
rence (from contingency tables). 

• Chi-squared goodness of fit tests are used to de¬ 
termine the adequacy of curves fit to data. The 
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null hypothesis is that the curve fit is adequate. It 
is common to determine curve shapes to minimize 
the mean square error, so it is appropriate that the 
goodness-of-fit calculation sums the squared errors. 

F-tests (analysis of variance, ANOVA) are commonly 
used when deciding whether groupings of data by cate¬ 
gory are meaningful. If the variance of test scores of the 
left-handed in a class is much smaller than the variance 
of the whole class, then it may be useful to study lefties 
as a group. The null hypothesis is that two variances are 
the same - so the proposed grouping is not meaningful. 

In the table below, the symbols used are defined at the 
bottom of the table. Many other tests can be found 
in other articles. Proofs exist that the test statistics are 
appropriate. 1211 

26.6 Origins and early controversy 

Significance testing is largely the product of Karl Pearson 
(p-value, Pearson’s chi-squared test), William Sealy Gos- 
set (Student’s t-distribution), and Ronald Fisher ("null hy¬ 
pothesis", analysis of variance, "significance test"), while 
hypothesis testing was developed by Jerzy Neyman and 
Egon Pearson (son of Karl). Ronald Fisher, mathemati¬ 
cian and biologist described by Richard Dawkins as “the 
greatest biologist since Darwin”, began his life in statis¬ 
tics as a Bayesian (Zabell 1992), but Fisher soon grew 
disenchanted with the subjectivity involved (namely use 
of the principle of indifference when determining prior 
probabilities), and sought to provide a more “objective” 
approach to inductive inference. 1271 

Fisher was an agricultural statistician who emphasized 
rigorous experimental design and methods to extract a re¬ 
sult from few samples assuming Gaussian distributions. 
Neyman (who teamed with the younger Pearson) empha¬ 
sized mathematical rigor and methods to obtain more re¬ 
sults from many samples and a wider range of distribu¬ 
tions. Modern hypothesis testing is an inconsistent hy¬ 
brid of the Fisher vs Neyman/Pearson formulation, meth¬ 
ods and terminology developed in the early 20th cen¬ 
tury. While hypothesis testing was popularized early in 
the 20th century, evidence of its use can be found much 
earlier. In the 1770s Laplace considered the statistics of 
almost half a million births. The statistics showed an ex¬ 
cess of boys compared to girls. 1281 He concluded by cal¬ 
culation of a p-value that the excess was a real, but unex¬ 
plained, effect. 1291 

Fisher popularized the “significance test”. He required a 
null-hypothesis (corresponding to a population frequency 
distribution) and a sample. His (now familiar) calcula¬ 
tions determined whether to reject the null-hypothesis or 
not. Significance testing did not utilize an alternative hy¬ 
pothesis so there was no concept of a Type II error. 

The p-value was devised as an informal, but objective. 
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The Origin of Modern Hypothesis Testing 

A Logically Inconsistent Hybrid of Fisher's Inferential Significance 
Testing and Neyman-Pearson Decision Theory 

Pages From: Lindquist, E.F. (1940) Statistical Analysis 
In Educational Research. Boston: Houghton Mifflin. 

Page 12: Lindquist interprets the result of a statistical test in 
terms of the falsity of experimental hypotheses without 
incorporating it into Fisher’s logic of experimental inference. 

the cases in a normal distribution deviate so far from the mean, 
lienee, if our hypothesis is true, something has happened in this 
one sample that would occur by chance in less than two per cent 
of such samples in the long run. Since it would Ik- very unreason¬ 
able to suppose that so rare an event lias actually “come off” in 
this one case, wc conclude that the hypothesis itself must be false. 
Consider, on the other hand, the hypothesis that the true mean is 
64.s- Under this hypothesis, means deviating as much from the 
true mean as does our obtained mean of 65 would be obtained in 
about 22 per cent of samples of this size. In this case, wc obviously 
could not reject the hypothesis with any high degree of confidence. 

Hie degree of confidence with which we may reject (or accept) 


Page 15: Lindquist defines the null hypothesis as a 
“nil” (zero difference) hypothesis. The use of non-nil 
hypotheses was relegated to a footnote. This possiblity is 
often omitted from later textbooks. 


boys of the same age?” In such cases we may wisn 10 te»v u.c 
hypothesis that the true correlation is zero, or that the true dif¬ 
ference is zero, but may not be particularly concerned with the de¬ 
gree of correlation or with the magnitude of the difference if any 
docs ezist. Such hypotheses - that the parameter is zero — arc 
known as null hypotheses.' If a statistic is such that the null 
hypothesis may be rejected with confidence, we say that the statis¬ 
tic is significant, meaning that it signifies that the parameter value 
is not zero. For example, we may select two random samples of 
pupils, teach one by one method and one by another, and find at 
the dose of the experiment that the difference in final mean achieve¬ 
ment is larger than could reasonably be attributed to fluctuations 
in random sampling, i.e., too large to permit us to accept the null 
hypothesis. Wc may then say that the observed difference in 
mean achievement is significant. It is important to note, however, 
that to prove the difference significant does not establish the cause 
of the difference. In rejecting the null hypothesis we have only 
rejected one possible cause — chance fluctuation due to random 

' The term "null hypothesis" is used by Fisher (Dcxign of Experiments, p. 18) to 
dcnole any exact hypothesis that we may be interested in disproving, not merely the 


Page 16: Inconsistent with his earlier interpretation of a test 
result as the falsity of the hypothesis, Lindquist advocates an 
interpretation in terms of Neyman-Pearson error rates (long 
run frequency of making an incorrect decision). 


It should lit- noted I lint il Is by no means desirable to insist' 
Ihc same level of significance in all li sts of significance. 11„ , |, ° n 
of the level of significance to employ should be based on the rd 
live consequences of the two ty|x s of error that are risked. On 
the one hand, we run the risk of accepting the null hypothesis when 
it is false, i.e., of characterizing a difference as not significant when 
a real difference does exist; and on the other hand we risk rejecting 
the null hypothesis when it is true, i.e., of daiming significance 
when the difference is really due to chance. The farther apart we 
set our limits of acceptable hypotheses, i.e., the higher the level of 
significance we employ, the greater is the danger that we will in¬ 
dude a false hypothesis among the “acceptable” hypotheses. 


Fora more detailed account: Halpin, P F (Winter 2006). "Inductive 
Inference or Inductive Behavior: Fisher and Neyman: Pearson 
Approaches to StatisticalTesting in Psychological Research (1940- 
1960)". The American Journal of Psychology 119(4): 625-653. 


A likely originator of the “hybrid” method of hypothesis testing, 
as well as the use of “nil”null hypotheses, is E.F. Lindquist in his 
statistics textbook: Lindquist, E.F. (1940) Statistical Analysis In 
Educational Research. Boston: Houghton Mifflin. 


index meant to help a researcher determine (based on 
other knowledge) whether to modify future experiments 
or strengthen one’s faith in the null hypothesis. 1301 Hy¬ 
pothesis testing (and Type I/II errors) was devised by 
Neyman and Pearson as a more objective alternative to 
Fisher’s p-value, also meant to determine researcher be¬ 
haviour, but without requiring any inductive inference by 
the researcher. 13111321 

Neyman & Pearson considered a different problem 
(which they called “hypothesis testing”). They initially 
considered two simple hypotheses (both with frequency 
distributions). They calculated two probabilities and typ¬ 
ically selected the hypothesis associated with the higher 
probability (the hypothesis more likely to have generated 
the sample). Their method always selected a hypothe¬ 
sis. It also allowed the calculation of both types of error 
probabilities. 

Fisher and Neyman/Pearson clashed bitterly. Ney- 
man/Pearson considered their formulation to be an im¬ 
proved generalization of significance testing.(The defin¬ 
ing paper 1 ’ 11 was abstract. Mathematicians have gen¬ 
eralized and refined the theory for decades. 1331 ) Fisher 
thought that it was not applicable to scientific research 
because often, during the course of the experiment, it is 
discovered that the initial assumptions about the null hy¬ 
pothesis are questionable due to unexpected sources of 
error. He believed that the use of rigid reject/accept de¬ 
cisions based on models formulated before data is col¬ 
lected was incompatible with this common scenario faced 
by scientists and attempts to apply this method to scien¬ 
tific research would lead to mass confusion. 1341 

The dispute between Fisher and Neyman-Pearson was 
waged on philosophical grounds, characterized by a 
philosopher as a dispute over the proper role of models 
in statistical inference. 1351 

Events intervened: Neyman accepted a position in the 
western hemisphere, breaking his partnership with Pear¬ 
son and separating disputants (who had occupied the 
same building) by much of the planetary diameter. World 
War II provided an intermission in the debate. The dis¬ 
pute between Fisher and Neyman terminated (unresolved 
after 27 years) with Fisher’s death in 1962. Neyman 
wrote a well-regarded eulogy. 1361 Some of Neyman’s later 
publications reported p-values and significance levels. 1371 

The modern version of hypothesis testing is a hybrid of 
the two approaches that resulted from confusion by writ¬ 
ers of statistical textbooks (as predicted by Fisher) be¬ 
ginning in the 1940s. 1381 (But signal detection, for exam¬ 
ple, still uses the Neyman/Pearson formulation.) Great 
conceptual differences and many caveats in addition to 
those mentioned above were ignored. Neyman and Pear¬ 
son provided the stronger terminology, the more rigorous 
mathematics and the more consistent philosophy, but the 
subject taught today in introductory statistics has more 
similarities with Fisher’s method than theirs. 1391 This his¬ 
tory explains the inconsistent terminology (example: the 
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null hypothesis is never accepted, but there is a region of 
acceptance). 

Sometime around 1940, 138 in an apparent effort to pro¬ 
vide researchers with a “non-controversial” 1401 way to 
have their cake and eat it too, the authors of statistical text 
books began anonymously combining these two strate¬ 
gies by using the p-value in place of the test statistic (or 
data) to test against the Neyman-Pearson “significance 
level”. 1381 Thus, researchers were encouraged to infer the 
strength of their data against some null hypothesis using 
p-values, while also thinking they are retaining the post- 
data collection objectivity provided by hypothesis test¬ 
ing. It then became customary for the null hypothesis, 
which was originally some realistic research hypothesis, 
to be used almost solely as a strawman “nil” hypothesis 
(one where a treatment has no effect, regardless of the 
context). 1411 

A comparison between Fisherian, frequentist 
(Neyman-Pearson) 


26.6.1 Early choices of null hypothesis 

Paul Meehl has argued that the epistemological impor¬ 
tance of the choice of null hypothesis has gone largely un¬ 
acknowledged. When the null hypothesis is predicted by 
theory, a more precise experiment will be a more severe 
test of the underlying theory. When the null hypothesis 
defaults to “no difference” or “no effect”, a more precise 
experiment is a less severe test of the theory that moti¬ 
vated performing the experiment. 1421 An examination of 
the origins of the latter practice may therefore be useful: 

1778 : Pierre Laplace compares the birthrates of boys and 
girls in multiple European cities. He states: “it is natu¬ 
ral to conclude that these possibilities are very nearly in 
the same ratio”. Thus Laplace’s null hypothesis that the 
birthrates of boys and girls should be equal given “con¬ 
ventional wisdom”. 1281 

1900 : Karl Pearson develops the chi squared test to de¬ 
termine “whether a given form of frequency curve will 
effectively describe the samples drawn from a given pop¬ 
ulation.” Thus the null hypothesis is that a population is 
described by some distribution predicted by theory. He 
uses as an example the numbers of five and sixes in the 
Weldon dice throw data. 1431 

1904 : Karl Pearson develops the concept of 

"contingency" in order to determine whether out¬ 
comes are independent of a given categorical factor. 
Here the null hypothesis is by default that two things 
are unrelated (e.g. scar formation and death rates 
from smallpox). 1441 The null hypothesis in this case is no 
longer predicted by theory or conventional wisdom, but is 
instead the principle of indifference that lead Fisher and 
others to dismiss the use of “inverse probabilities”. 1451 
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26.7 Null hypothesis statistical sig¬ 
nificance testing vs hypothesis 
testing 

An example of Neyman-Pearson hypothesis testing can 
be made by a change to the radioactive suitcase example. 
If the “suitcase” is actually a shielded container for the 
transportation of radioactive material, then a test might 
be used to select among three hypotheses: no radioactive 
source present, one present, two (all) present. The test 
could be required for safety, with actions required in each 
case. The Neyman-Pearson lemma of hypothesis testing 
says that a good criterion for the selection of hypotheses 
is the ratio of their probabilities (a likelihood ratio). A 
simple method of solution is to select the hypothesis with 
the highest probability for the Geiger counts observed. 
The typical result matches intuition: few counts imply no 
source, many counts imply two sources and intermediate 
counts imply one source. 

Neyman-Pearson theory can accommodate both prior 
probabilities and the costs of actions resulting from 
decisions. 1461 The former allows each test to consider the 
results of earlier tests (unlike Fisher’s significance tests). 
The latter allows the consideration of economic issues 
(for example) as well as probabilities. A likelihood ratio 
remains a good criterion for selecting among hypotheses. 

The two forms of hypothesis testing are based on differ¬ 
ent problem formulations. The original test is analogous 
to a true/false question; the Neyman-Pearson test is more 
like multiple choice. In the view of Tukey 1471 the former 
produces a conclusion on the basis of only strong evidence 
while the latter produces a decision on the basis of avail¬ 
able evidence. While the two tests seem quite different 
both mathematically and philosophically, later develop¬ 
ments lead to the opposite claim. Consider many tiny 
radioactive sources. The hypotheses become 0,1,2,3... 
grains of radioactive sand. There is little distinction be¬ 
tween none or some radiation (Fisher) and 0 grains of 
radioactive sand versus all of the alternatives (Neyman- 
Pearson). The major Neyman-Pearson paper of 1933 1311 
also considered composite hypotheses (ones whose dis¬ 
tribution includes an unknown parameter). An example 
proved the optimality of the (Student’s) f-test, “there can 
be no better test for the hypothesis under consideration” 
(p 321). Neyman-Pearson theory was proving the opti¬ 
mality of Fisherian methods from its inception. 

Fisher’s significance testing has proven a popular flexi¬ 
ble statistical tool in application with little mathematical 
growth potential. Neyman-Pearson hypothesis testing is 
claimed as a pillar of mathematical statistics, 1481 creating 
a new paradigm for the field. It also stimulated new ap¬ 
plications in Statistical process control, detection theory, 
decision theory and game theory. Both formulations have 
been successful, but the successes have been of a differ¬ 
ent character. 
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The dispute over formulations is unresolved. Science 
primarily uses Fisher’s (slightly modified) formulation 
as taught in introductory statistics. Statisticians study 
Neyman-Pearson theory in graduate school. Mathemati¬ 
cians are proud of uniting the formulations. Philoso¬ 
phers consider them separately. Learned opinions deem 
the formulations variously competitive (Fisher vs Ney- 
man), incompatible 1271 or complementary. 1331 The dis¬ 
pute has become more complex since Bayesian inference 
has achieved respectability. 

The terminology is inconsistent. Hypothesis testing can 
mean any mixture of two formulations that both changed 
with time. Any discussion of significance testing vs hy¬ 
pothesis testing is doubly vulnerable to confusion. 

Fisher thought that hypothesis testing was a useful strat¬ 
egy for performing industrial quality control, however, he 
strongly disagreed that hypothesis testing could be use¬ 
ful for scientists. 1301 Hypothesis testing provides a means 
of finding test statistics used in significance testing. 1331 
The concept of power is useful in explaining the conse¬ 
quences of adjusting the significance level and is heavily 
used in sample size determination. The two methods re¬ 
main philosophically distinct. 1351 They usually (but not al¬ 
ways) produce the same mathematical answer. The pre¬ 
ferred answer is context dependent. 1331 While the exist¬ 
ing merger of Fisher and Neyman-Pearson theories has 
been heavily criticized, modifying the merger to achieve 
Bayesian goals has been considered. 1491 


26.8 Criticism 

See also: p-value § Criticisms 

Criticism of statistical hypothesis testing fills 
volumes 150115111521153115411551 citing 300-400 primary 
references. Much of the criticism can be summarized by 
the following issues: 

• The interpretation of a /;-value is dependent upon 
stopping rule and definition of multiple compari¬ 
son. The former often changes during the course 
of a study and the latter is unavoidably ambiguous, 
(i.e. “p values depend on both the (data) observed 
and on the other possible (data) that might have been 
observed but weren't”). 1561 

• Confusion resulting (in part) from combining the 
methods of Fisher and Neyman-Pearson which are 
conceptually distinct. 1471 

• Emphasis on statistical significance to the exclu¬ 
sion of estimation and confirmation by repeated 
experiments. 1571 

• Rigidly requiring statistical significance as a crite¬ 
rion for publication, resulting in publication bias. 1581 


Most of the criticism is indirect. Rather than be¬ 
ing wrong, statistical hypothesis testing is misunder¬ 
stood, overused and misused. 

• When used to detect whether a difference exists be¬ 
tween groups, a paradox arises. As improvements 
are made to experimental design (e.g., increased 
precision of measurement and sample size), the test 
becomes more lenient. Unless one accepts the ab¬ 
surd assumption that all sources of noise in the data 
cancel out completely, the chance of finding sta¬ 
tistical significance in either direction approaches 
100 %. [59] 

• Layers of philosophical concerns. The probability 
of statistical significance is a function of decisions 
made by experimenters/analysts. 1101 If the decisions 
are based on convention they are termed arbitrary or 
mindless 1401 while those not so based may be termed 
subjective. To minimize type II errors, large sam¬ 
ples are recommended. In psychology practically 
all null hypotheses are claimed to be false for suffi¬ 
ciently large samples so "...it is usually nonsensical to 
perform an experiment with the sole aim of reject¬ 
ing the null hypothesis.”. 1601 “Statistically significant 
findings are often misleading” in psychology. 1611 Sta¬ 
tistical significance does not imply practical sig¬ 
nificance and correlation does not imply causation. 
Casting doubt on the null hypothesis is thus far from 
directly supporting the research hypothesis. 

• "[I]t does not tell us what we want to know”. 1621 Lists 
of dozens of complaints are available. 15411631 

Critics and supporters are largely in factual agreement re¬ 
garding the characteristics of null hypothesis significance 
testing (NHST): While it can provide critical information, 
it is inadequate as the sole tool for statistical analysis. Suc¬ 
cessfully rejecting the null hypothesis may offer no support 
for the research hypothesis. The continuing controversy 
concerns the selection of the best statistical practices for 
the near-term future given the (often poor) existing prac¬ 
tices. Critics would prefer to ban NHST completely, forc¬ 
ing a complete departure from those practices, while sup¬ 
porters suggest a less absolute change. 

Controversy over significance testing, and its effects 
on publication bias in particular, has produced several 
results. The American Psychological Association has 
strengthened its statistical reporting requirements after 
review, 1641 medical journal publishers have recognized 
the obligation to publish some results that are not statisti¬ 
cally significant to combat publication bias 1651 and a jour¬ 
nal ( Journal of Articles in Support of the Null Hypothesis ) 
has been created to publish such results exclusively. 1661 
Textbooks have added some cautions 1671 and increased 
coverage of the tools necessary to estimate the size of the 
sample required to produce significant results. Major or¬ 
ganizations have not abandoned use of significance tests 
although some have discussed doing so. 1641 
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26.9 Alternatives 

Main article: Estimation statistics 

See also: Confidence interval § Statistical hypothesis 

testing 

The numerous criticisms of significance testing do not 
lead to a single alternative. A unifying position of critics 
is that statistics should not lead to a conclusion or a de¬ 
cision but to a probability or to an estimated value with 
a confidence interval rather than to an accept-reject de¬ 
cision regarding a particular hypothesis. It is unlikely 
that the controversy surrounding significance testing will 
be resolved in the near future. Its supposed flaws and 
unpopularity do not eliminate the need for an objective 
and transparent means of reaching conclusions regard¬ 
ing studies that produce statistical results. Critics have 
not unified around an alternative. Other forms of re¬ 
porting confidence or uncertainty could probably grow in 
popularity. One strong critic of significance testing sug¬ 
gested a list of reporting alternatives: 1681 effect sizes for 
importance, prediction intervals for confidence, replica¬ 
tions and extensions for replicability, meta-analyses for 
generality. None of these suggested alternatives pro¬ 
duces a conclusion/decision. Lehmann said that hypoth¬ 
esis testing theory can be presented in terms of con¬ 
clusions/decisions, probabilities, or confidence intervals. 
“The distinction between the ... approaches is largely one 
of reporting and interpretation.” 1691 

On one “alternative” there is no disagreement: Fisher 
himself said, 1171 “In relation to the test of significance, 
we may say that a phenomenon is experimentally demon¬ 
strable when we know how to conduct an experiment 
which will rarely fail to give us a statistically significant 
result.” Cohen, an influential critic of significance test¬ 
ing, concurred, 1621 "... don't look for a magic alternative 
to NHST [null hypothesis significance testing]... It doesn't 
exist.” "... given the problems of statistical induction, we 
must finally rely, as have the older sciences, on replica¬ 
tion.” The “alternative” to significance testing is repeated 
testing. The easiest way to decrease statistical uncertainty 
is by obtaining more data, whether by increased sample 
size or by repeated tests. Nickerson claimed to have never 
seen the publication of a literally replicated experiment 
in psychology. 1631 An indirect approach to replication is 
meta-analysis. 

Bayesian inference is one proposed alternative to signifi¬ 
cance testing. (Nickerson cited 10 sources suggesting it, 
including Rozeboom (I960)). 1631 For example, Bayesian 
parameter estimation can provide rich information about 
the data from which researchers can draw inferences, 
while using uncertain priors that exert only minimal in¬ 
fluence on the results when enough data is available. Psy¬ 
chologist Kruschke, lohn K. has suggested Bayesian es¬ 
timation as an alternative for the t-test. 1 ™ 1 Alternatively 
two competing models/hypothesis can be compared us¬ 
ing Bayes factors. 1711 Bayesian methods could be criti¬ 


cized for requiring information that is seldom available in 
the cases where significance testing is most heavily used. 
Neither the prior probabilities nor the probability distri¬ 
bution of the test statistic under the alternative hypothesis 
are often available in the social sciences. 1631 

Advocates of a Bayesian approach sometimes claim that 
the goal of a researcher is most often to objectively assess 
the probability that a hypothesis is true based on the data 
they have collected. 17211731 Neither Fisher's significance 
testing, nor Neyman-Pearson hypothesis testing can pro¬ 
vide this information, and do not claim to. The proba¬ 
bility a hypothesis is true can only be derived from use 
of Bayes’ Theorem, which was unsatisfactory to both the 
Fisher and Neyman-Pearson camps due to the explicit use 
of subjectivity in the form of the prior probability. 13111741 
Fisher’s strategy is to sidestep this with the p-value (an ob¬ 
jective index based on the data alone) followed by induc¬ 
tive inference, while Neyman-Pearson devised their ap¬ 
proach of inductive behaviour. 

26.10 Philosophy 

Hypothesis testing and philosophy intersect. Inferential 
statistics, which includes hypothesis testing, is applied 
probability. Both probability and its application are inter¬ 
twined with philosophy. Philosopher David Hume wrote, 
“All knowledge degenerates into probability.” Competing 
practical definitions of probability reflect philosophical 
differences. The most common application of hypothe¬ 
sis testing is in the scientific interpretation of experimen¬ 
tal data, which is naturally studied by the philosophy of 
science. 

Fisher and Neyman opposed the subjectivity of probabil¬ 
ity. Their views contributed to the objective definitions. 
The core of their historical disagreement was philosoph¬ 
ical. 

Many of the philosophical criticisms of hypothesis test¬ 
ing are discussed by statisticians in other contexts, partic¬ 
ularly correlation does not imply causation and the design 
of experiments. Hypothesis testing is of continuing in¬ 
terest to philosophers. 13511751 

26.11 Education 

Main article: Statistics education 

Statistics is increasingly being taught in schools with hy¬ 
pothesis testing being one of the elements taught. 17611771 
Many conclusions reported in the popular press (politi¬ 
cal opinion polls to medical studies) are based on statis¬ 
tics. An informed public should understand the lim¬ 
itations of statistical conclusions 17811791 and many col¬ 
lege fields of study require a course in statistics for 
the same reason. 17811791 An introductory college statistics 
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class places much emphasis on hypothesis testing - per¬ 
haps half of the course. Such fields as literature and di¬ 
vinity now include findings based on statistical analysis 
(see the Bible Analyzer). An introductory statistics class 
teaches hypothesis testing as a cookbook process. Hy¬ 
pothesis testing is also taught at the postgraduate level. 
Statisticians learn how to create good statistical test pro¬ 
cedures (like z, Student’s t, F and chi-squared). Statisti¬ 
cal hypothesis testing is considered a mature area within 
statistics, 1691 but a limited amount of development con¬ 
tinues. 

The cookbook method of teaching introductory statis¬ 
tics leaves no time for history, philosophy or controversy. 
Hypothesis testing has been taught as received unified 
method. Surveys showed that graduates of the class were 
filled with philosophical misconceptions (on all aspects of 
statistical inference) that persisted among instructors. 1801 
While the problem was addressed more than a decade 
ago, 181 and calls for educational reform continue, 1821 stu¬ 
dents still graduate from statistics classes holding funda¬ 
mental misconceptions about hypothesis testing. 1831 Ideas 
for improving the teaching of hypothesis testing include 
encouraging students to search for statistical errors in 
published papers, teaching the history of statistics and 
emphasizing the controversy in a generally dry subject. 1841 


26.12 See also 

• Behrens-Fisher problem 

• Bootstrapping (statistics) 

• Checking if a coin is fair 

• Comparing means test decision tree 

• Complete spatial randomness 

• Counternull 

• Falsifiability 

• Fisher’s method for combining independent tests of 
significance 

• Granger causality 

• Look-elsewhere effect 

• Modifiable areal unit problem 

• Omnibus test 
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26.15.1 Online calculators 
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Chapter 27 

Bayesian inference 


Bayesian inference is a method of statistical inference in 
which Bayes’ theorem is used to update the probability for 
a hypothesis as evidence is acquired. Bayesian inference 
is an important technique in statistics, and especially in 
mathematical statistics. Bayesian updating is particularly 
important in the dynamic analysis of a sequence of data. 
Bayesian inference has found application in a wide range 
of activities, including science, engineering, philosophy, 
medicine, and law. In the philosophy of decision the¬ 
ory, Bayesian inference is closely related to subjective 
probability, often called "Bayesian probability". Bayesian 
probability provides a rational method for updating be¬ 
liefs. 


27.1 Introduction to Bayes’ rule 
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A) P(A)/P(B) etc. 


Main article: Bayes’ rule 
See also: Bayesian probability 


27.1.1 Formal 

Bayesian inference derives the posterior probability as a 
consequence of two antecedents, a prior probability and 
a "likelihood function" derived from a statistical model 


for the observed data. Bayesian inference computes the 
posterior probability according to Bayes’ theorem: 


P(H | E) = P(B 1 ? ■ P(H > 

V ' ’ P{E) 

where 

• | denotes a conditional probability; more specifi¬ 
cally, it means given. 

• H stands for any hypothesis whose probability may 
be affected by data (called evidence below). Of¬ 
ten there are competing hypotheses, from which one 
chooses the most probable. 

• the evidence E corresponds to new data that were 
not used in computing the prior probability. 

• P{H) , the prior probability, is the probability of II 
before E is observed. This indicates one’s previous 
estimate of the probability that a hypothesis is true, 
before gaining the current evidence. 

• P(H | E) , the posterior probability, is the proba¬ 
bility of H given E , i.e., after E is observed. This 
tells us what we want to know: the probability of a 
hypothesis given the observed evidence. 

• P(E | II ) is the probability of observing E given 
H . As a function of H with E fixed, this is the 
likelihood. The likelihood function should not be 
confused with P{H E) as a function of H rather 
than of E . It indicates the compatibility of the evi¬ 
dence with the given hypothesis. 

• P{E) is sometimes termed the marginal likelihood 
or “model evidence”. This factor is the same for all 
possible hypotheses being considered. (This can be 
seen by the fact that the hypothesis H does not ap¬ 
pear anywhere in the symbol, unlike for all the other 
factors.) This means that this factor does not enter 
into determining the relative probabilities of differ¬ 
ent hypotheses. 

Note that, for different values of H , only the factors 

P(H) and P(E \ H) affect the value of P(H \ E) 
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. As both of these factors appear in the numerator, the 
posterior probability is proportional to both. In words: 

• (more precisely) The posterior probability of a hy¬ 
pothesis is determined by a combination of the inher¬ 
ent likeliness of a hypothesis (the prior) and the com¬ 
patibility of the observed evidence with the hypothesis 
(the likelihood). 

• (more concisely) Posterior is proportional to likeli¬ 
hood times prior. 

Note that Bayes’ rule can also be written as follows: 


P(H | E) 


P(E | H) 
P(E) 


' P(H) 


where the factor represents the impact of E on 

the probability of H . 


27.1.2 Informal 

If the evidence does not match up with a hypothesis, one 
should reject the hypothesis. But if a hypothesis is ex¬ 
tremely unlikely a priori , one should also reject it, even if 
the evidence does appear to match up. 

For example, imagine that I have various hypotheses 
about the nature of a newborn baby of a friend, including: 


new evidence. This allows for Bayesian principles to be 
applied to various kinds of evidence, whether viewed all 
at once or over time. This procedure is termed “Bayesian 
updating”. 


27.1.3 Bayesian updating 

Bayesian updating is widely used and computationally 
convenient. However, it is not the only updating rule that 
might be considered “rational”. 

Ian Hacking noted that traditional "Dutch book" argu¬ 
ments did not specify Bayesian updating: they left open 
the possibility that non-Bayesian updating rules could 
avoid Dutch books. Hacking wrote 11 * “And neither the 
Dutch book argument, nor any other in the personalist 
arsenal of proofs of the probability axioms, entails the 
dynamic assumption. Not one entails Bayesianism. So 
the personalist requires the dynamic assumption to be 
Bayesian. It is true that in consistency a personalist could 
abandon the Bayesian model of learning from experience. 
Salt could lose its savour.” 

Indeed, there are non-Bayesian updating rules that also 
avoid Dutch books (as discussed in the literature on 
"probability kinematics" following the publication of 
Richard C. Jeffrey's rule, which applies Bayes’ rule to the 
case where the evidence itself is assigned a probability. 12 * 
The additional hypotheses needed to uniquely require 
Bayesian updating have been deemed to be substantial, 
complicated, and unsatisfactory . 131 


• Hi : the baby is a brown-haired boy. 

• H 2 : the baby is a blond-haired girl. 

• IT 3 : the baby is a dog. 


27.2 Formal description of 
Bayesian inference 


Then consider two scenarios: 


27.2.1 Definitions 


1. I'm presented with evidence in the form of a pic¬ 
ture of a blond-haired baby girl. I find this evidence 
supports H 2 and opposes II \ and //,- . 

2. I'm presented with evidence in the form of a picture 
of a baby dog. Although this evidence, treated in 
isolation, supports H 3 , my prior belief in this hy¬ 
pothesis (that a human can give birth to a dog) is 
extremely small, so the posterior probability is nev¬ 
ertheless small. 

The critical point about Bayesian inference, then, is that 
it provides a principled way of combining new evidence 
with prior beliefs, through the application of Bayes’ rule. 
(Contrast this with frequentist inference, which relies 
only on the evidence as a whole, with no reference to 
prior beliefs.) Furthermore, Bayes’ rule can be applied 
iteratively: after observing some evidence, the resulting 
posterior probability can then be treated as a prior prob¬ 
ability, and a new posterior probability computed from 


• x , a data point in general. This may in fact be a 
vector of values. 

• 9 , the parameter of the data point’s distribution, i.e., 
x ~ p(x | 9) . This may in fact be a vector of 
parameters. 

• a , the hyperparameter of the parameter, i.e., 9 ~ 
p(9 | a) . This may in fact be a vector of hyperpa¬ 
rameters. 

• X , a set of n observed data points, i.e., x±,... ,x n 

• x , a new data point whose distribution is to be pre¬ 
dicted. 

27.2.2 Bayesian inference 

• The prior distribution is the distribution of the pa- 
rameter(s) before any data is observed, i.e. p(6 \ a) 
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• The prior distribution might not be easily deter¬ 
mined. In this case, we can use the Jeffreys prior 
to obtain the posterior distribution before updating 
them with newer observations. 

• The sampling distribution is the distribution of the 
observed data conditional on its parameters, i.e. 
p(X | 9) . This is also termed the likelihood, espe¬ 
cially when viewed as a function of the parameter(s), 
sometimes written L (9 | X) = p(X \ 9) . 

• The marginal likelihood (sometimes also termed the 
evidence) is the distribution of the observed data 
marginalized over the parameter(s), i.e. p(X \ a) = 
J g p(X\9)p(9\a)d9. 

• The posterior distribution is the distribution of the 
parameter(s) after taking into account the observed 
data. This is determined by Bayes’ rule, which forms 
the heart of Bayesian inference: 

p{6 | X, a) = P(X 1 yj a) « P ( X I 9 )P(° I “) 
p(X | a) 

Note that this is expressed in words as “posterior is pro¬ 
portional to likelihood times prior”, or sometimes as 
“posterior = likelihood times prior, over evidence”. 

27.2.3 Bayesian prediction 

• The posterior predictive distribution is the distribu¬ 
tion of a new data point, marginalized over the pos¬ 
terior: 

p(x | X, a) = I p{x | 9)p{6 | X, a ) d 9 
Je 

• The prior predictive distribution is the distribution 
of a new data point, marginalized over the prior: 

p(x | a) = I p(x | 6)p{9 | a ) d 9 
Je 

Bayesian theory calls for the use of the posterior predic¬ 
tive distribution to do predictive inference, i.e., to predict 
the distribution of a new, unobserved data point. That is, 
instead of a fixed point as a prediction, a distribution over 
possible points is returned. Only this way is the entire 
posterior distribution of the parameter(s) used. By com¬ 
parison, prediction in frequentist statistics often involves 
finding an optimum point estimate of the parameter(s)— 
e.g., by maximum likelihood or maximum a posteriori 
estimation (MAP)—and then plugging this estimate into 
the formula for the distribution of a data point. This has 
the disadvantage that it does not account for any uncer¬ 
tainty in the value of the parameter, and hence will un¬ 
derestimate the variance of the predictive distribution. 


(In some instances, frequentist statistics can work around 
this problem. For example, confidence intervals and 
prediction intervals in frequentist statistics when con¬ 
structed from a normal distribution with unknown 
mean and variance are constructed using a Student’s t- 
distribution. This correctly estimates the variance, due to 
the fact that (1) the average of normally distributed ran¬ 
dom variables is also normally distributed; (2) the predic¬ 
tive distribution of a normally distributed data point with 
unknown mean and variance, using conjugate or uninfor¬ 
mative priors, has a student’s t-distribution. In Bayesian 
statistics, however, the posterior predictive distribution 
can always be determined exactly—or at least, to an ar¬ 
bitrary level of precision, when numerical methods are 
used.) 

Note that both types of predictive distributions have the 
form of a compound probability distribution (as does the 
marginal likelihood). In fact, if the prior distribution is a 
conjugate prior, and hence the prior and posterior distri¬ 
butions come from the same family, it can easily be seen 
that both prior and posterior predictive distributions also 
come from the same family of compound distributions. 
The only difference is that the posterior predictive dis¬ 
tribution uses the updated values of the hyperparameters 
(applying the Bayesian update rules given in the conjugate 
prior article), while the prior predictive distribution uses 
the values of the hyperparameters that appear in the prior 
distribution. 

27.3 Inference over exclusive and 
exhaustive possibilities 

If evidence is simultaneously used to update belief over 
a set of exclusive and exhaustive propositions, Bayesian 
inference may be thought of as acting on this belief dis¬ 
tribution as a whole. 


27.3.1 General formulation 

Suppose a process is generating independent and iden¬ 
tically distributed events E n , but the probability distri¬ 
bution is unknown. Let the event space fl represent the 
current state of belief for this process. Each model is 
represented by event M m . The conditional probabilities 
P(E n | M m ) are specified to define the models. P(M m ) 
is the degree of belief in M m . Before the first infer¬ 
ence step, (P(M m )} is a set of initial prior probabilities. 
These must sum to 1, but are otherwise arbitrary. 

Suppose that the process is observed to generate E £ 
{E n } . For each M £ {M m } , the prior P(M) is up¬ 
dated to the posterior P(M \ E) . From Bayes’ theo¬ 
rem: 141 
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P( M x ) ? 

P( m 2 ) ? 

P( Ei| Mi) 

P( E 2 | Mi) 

P( E 3 | Mi) 


P( Ei | M 2 ) 

P( e 2 | M 2 ) 

P( E 3 | M 2 ) 

P( M 3 ) ? 


P( Ei| M 3 ) 

P( E 2 | Mb) 

P( E 3 | M 3 ) 

■ ■ ■ 


Diagram illustrating event space in general formulation of 
Bayesian inference. Although this diagram shows discrete mod¬ 
els and events, the continuous case may be visualized similarly 
using probability densities. 


P(M | E) 


P(E | M) 

Y Jm P{E\M m )P{M m ) 


■ P(M) 


Upon observation of further evidence, this procedure may 
be repeated. 


27.3.2 Multiple observations 

For a set of independent and identically distributed obser¬ 
vations E = {ei,..., e n j , it may be shown that repeated 
application of the above is equivalent to 


P(M | E) = - 


P(E | M) 


J2 m P(E\M m )P(M m ) 


■ P(M) 


Where 


P{E\M) = l[P(e k \M). 

k 

This may be used to optimize practical calculations. 


27.3.3 Parametric formulation 

By parameterizing the space of models, the belief in all 
models may be updated in a single step. The distribution 
of belief over the model space may then be thought of 
as a distribution of belief over the parameter space. The 
distributions in this section are expressed as continuous, 
represented by probability densities, as this is the usual 


situation. The technique is however equally applicable to 
discrete distributions. 

Let the vector 9 span the parameter space. Let the ini¬ 
tial prior distribution over 9 be p{6 \ a) , where a is a 
set of parameters to the prior itself, or hyperparameters. 
Let E = {e \,..., e n j be a set of independent and identi¬ 
cally distributed event observations, where all e, are dis¬ 
tributed as p(e | 9) for some 9 . Bayes’ theorem is applied 
to find the posterior distribution over 9 : 


p(8|E,a) = 1, ^J^“ ) . P 0)|a) 

P {E I 0, a) 


J g p(E\9,a)p(9 | a)d9 


■p{9 


Where 


P ( E I 9, a) = \\_p{e k \ 9) 

k 


27.4 Mathematical properties 

27.4.1 Interpretation of factor 

p( pff > 1 =► P[E | M) > P{E) . That is, if the 

model were true, the evidence would be more likely than 
is predicted by the current state of belief. The reverse 
applies for a decrease in belief. If the belief does not 
change, = 1 => P(E \ M) = P{E) . That is, 

the evidence is independent of the model. If the model 
were true, the evidence would be exactly as likely as pre¬ 
dicted by the current state of belief. 


27.4.2 Cromwell’s rule 

Main article: Cromwell’s rule 

If P(M) = 0 then P{M \ E) = 0 . If P{M) = 1 , 
then P(M\E) = 1. This can be interpreted to mean that 
hard convictions are insensitive to counter-evidence. 

The former follows directly from Bayes’ theorem. The 
latter can be derived by applying the first rule to the event 
“not M " in place of " M ", yielding “if 1 — P(M ) = 
0 , then 1 — P{M \ E) = 0 ", from which the result 
immediately follows. 


27.4.3 Asymptotic behaviour of posterior 

Consider the behaviour of a belief distribution as it is 
updated a large number of times with independent and 
identically distributed trials. For sufficiently nice prior 
probabilities, the Bernstein-von Mises theorem gives that 
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in the limit of infinite trials, the posterior converges to a 
Gaussian distribution independent of the initial prior un¬ 
der some conditions firstly outlined and rigorously proven 
by Joseph L. Doob in 1948, namely if the random vari¬ 
able in consideration has a finite probability space. The 
more general results were obtained later by the statisti¬ 
cian David A. Freedman who published in two seminal 
research papers in 1963 and 1965 when and under what 
circumstances the asymptotic behaviour of posterior is 
guaranteed. His 1963 paper treats, like Doob (1949), the 
finite case and comes to a satisfactory conclusion. How¬ 
ever, if the random variable has an infinite but countable 
probability space (i.e., corresponding to a die with infi¬ 
nite many faces) the 1965 paper demonstrates that for a 
dense subset of priors the Bernstein-von Mises theorem 
is not applicable. In this case there is almost surely no 
asymptotic convergence. Later in the 1980s and 1990s 
Freedman and Persi Diaconis continued to work on the 
case of infinite countable probability spaces. 151 To sum¬ 
marise, there may be insufficient trials to suppress the ef¬ 
fects of the initial choice, and especially for large (but 
finite) systems the convergence might be very slow. 


There are examples where no maximum is attained, in 
which case the set of MAP estimates is empty. 

There are other methods of estimation that minimize the 
posterior risk (expected-posterior loss) with respect to a 
loss function, and these are of interest to statistical deci¬ 
sion theory using the sampling distribution (“frequentist 
statistics”). 

The posterior predictive distribution of a new observa¬ 
tion x (that is independent of previous observations) is 
determined by 


p(a;|X, a) = f p(x, 9 | X, a) dd = f p(x \ 9)p(9 | X, a) d9. 
Je Je 


27.5 Examples 

27.5.1 Probability of a hypothesis 


27.4.4 Conjugate priors 

Main article: Conjugate prior 

In parameterized form, the prior distribution is often 
assumed to come from a family of distributions called 
conjugate priors. The usefulness of a conjugate prior is 
that the corresponding posterior distribution will be in 
the same family, and the calculation may be expressed 
in closed form. 

27.4.5 Estimates of parameters and pre¬ 
dictions 

It is often desired to use a posterior distribution to es¬ 
timate a parameter or variable. Several methods of 
Bayesian estimation select measurements of central ten¬ 
dency from the posterior distribution. 


Suppose there are two full bowls of cookies. Bowl #1 has 
10 chocolate chip and 30 plain cookies, while bowl #2 has 
20 of each. Our friend Fred picks a bowl at random, and 
then picks a cookie at random. We may assume there is 
no reason to believe Fred treats one bowl differently from 
another, likewise for the cookies. The cookie turns out to 
be a plain one. How probable is it that Fred picked it out 
of bowl #1? 

Intuitively, it seems clear that the answer should be more 
than a half, since there are more plain cookies in bowl #1. 
The precise answer is given by Bayes’ theorem. Let Hi 
correspond to bowl #1, and Hi to bowl #2. It is given that 
the bowls are identical from Fred’s point of view, thus 
P(H\) = P{Hi) , and the two must add up to 1, so 
both are equal to 0.5. The event E is the observation of 
a plain cookie. From the contents of the bowls, we know 
that P{E | Hi) = 30/40 = 0.75 and P(E | H 2 ) = 
20/40 = 0.5. Bayes’ formula then yields 


For one-dimensional problems, a unique median exists 

for practical continuous problems. The posterior median P(H\ \ E) 

is attractive as a robust estimator. 161 


_ P(E\Hi)P(Hi) _ 

P(E | Hi)P(Hi) + P(E | Hi)P{H 2 ) 


If there exists a finite mean for the posterior distribution, 
then the posterior mean is a method of estimation. 


0.75 x 0.5 

0.75 x 0.5+ 0.5 x 0.5 


9 = E[9] = f 9p{9 | X,a)d9 
Je 

Taking a value with the greatest probability defines 
maximum a posteriori (MAP) estimates: 

{^map} C argmaxp(0 | X, a). 

6 


= 0.6 

Before we observed the cookie, the probability we as¬ 
signed for Fred having chosen bowl #1 was the prior prob¬ 
ability, P(Hi) , which was 0.5. After observing the 
cookie, we must revise the probability to P( H\ \ E) , 
which is 0.6. 
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Century 


Example results for archaeology example. This simulation was 
generated using c=15.2. 

27.5.2 Making a prediction 

An archaeologist is working at a site thought to be from 
the medieval period, between the 11th century to the 16th 
century. However, it is uncertain exactly when in this 
period the site was inhabited. Fragments of pottery are 
found, some of which are glazed and some of which are 
decorated. It is expected that if the site were inhabited 
during the early medieval period, then 1% of the pottery 
would be glazed and 50% of its area decorated, whereas 
if it had been inhabited in the late medieval period then 
81% would be glazed and 5% of its area decorated. How 
confident can the archaeologist be in the date of inhabi¬ 
tation as fragments are unearthed? 

The degree of belief in the continuous variable C (cen¬ 
tury) is to be calculated, with the discrete set of events 
{GD,GD,GD,GD} as evidence. Assuming linear 
variation of glaze and decoration with time, and that these 
variables are independent. 


14th century and 36% during the 15th century. Note 
that the Bernstein-von Mises theorem asserts here the 
asymptotic convergence to the “true” distribution because 
the probability space corresponding to the discrete set of 
events { GD , GD , CD. GD} is finite (see above section 
on asymptotic behaviour of the posterior). 

27.6 In frequentist statistics and 
decision theory 

A decision-theoretic justification of the use of Bayesian 
inference was given by Abraham Wald, who proved that 
every unique Bayesian procedure is admissible. Con¬ 
versely, every admissible statistical procedure is either a 
Bayesian procedure or a limit of Bayesian procedures. 171 

Wald characterized admissible procedures as Bayesian 
procedures (and limits of Bayesian procedures), mak¬ 
ing the Bayesian formalism a central technique in such 
areas of frequentist inference as parameter estimation, 
hypothesis testing, and computing confidence intervals. 181 
For example: 

• “Under some conditions, all admissible procedures 
are either Bayes procedures or limits of Bayes proce¬ 
dures (in various senses). These remarkable results, 
at least in their original form, are due essentially to 
Wald. They are useful because the property of being 
Bayes is easier to analyze than admissibility.” 171 

• “In decision theory, a quite general method for prov¬ 
ing admissibility consists in exhibiting a procedure 
as a unique Bayes solution.” 191 


P(E = GD 

P(E = GD | 

P(E = GD | 

P(E = GD\C = c) 

Assume a uniform prior of fc(c) = 0.2 , and that tri¬ 
als are independent and identically distributed. When a 
new fragment of type e is discovered, Bayes’ theorem is 
applied to update the degree of belief for each c : 

fc(c I E = e) = n p=E=^ c) fc(c) = 

P(E—e\C=c) , , s 

fii P(E=e\C—c)fc(c)dc V C > 

A computer simulation of the changing belief as 50 frag¬ 
ments are unearthed is shown on the graph. In the sim¬ 
ulation, the site was inhabited around 1420, or c = 15.2 
. By calculating the area under the relevant portion of 
the graph for 50 trials, the archaeologist can say that 
there is practically no chance the site was inhabited in the 
11th and 12th centuries, about 1% chance that it was in¬ 
habited during the 13 th century, 63% chance during the 


j I Bayes procedures with respect to more general prior 
distributions have played a very important role in 
the development of statistics, including its asymp¬ 
totic theory.” “There are many problems where a 
glance at posterior distributions, for suitable priors, 
yields immediately interesting information. Also, 
this technique can hardly be avoided in sequential 
analysis.” 1101 

• “A useful fact is that any Bayes decision rule ob¬ 
tained by taking a proper prior over the whole pa¬ 
rameter space must be admissible” 1111 

• “An important area of investigation in the develop¬ 
ment of admissibility ideas has been that of conven¬ 
tional sampling-theory procedures, and many inter¬ 
esting results have been obtained.” 1121 


C = c) = (0.01+0.16(c-ll))(0.5-0.09(c-ll)) 

• “In the first chapters of this work, prior distributions 

C = c) = (0.01+0.16(c—ll))(0.5+0.09(c—ll)yith finite support and the corresponding Bayes pro¬ 
cedures were used to establish some of the main the- 
^ = c ) = (0.99—0.16(c—ll))(0.5—0.09(c—ll^ rems re i a ting to the comparison of experiments. 

(0.99—0.16(c—11)) (0.5+0.09(c- 
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27.6.1 Model selection 

See Bayesian model selection 


27.7 Applications 


posterior from one stage becoming the prior for the next. 
The benefit of a Bayesian approach is that it gives the ju¬ 
ror an unbiased, rational mechanism for combining evi¬ 
dence. It may be appropriate to explain Bayes’ theorem to 
jurors in odds form, as betting odds are more widely un¬ 
derstood than probabilities. Alternatively, a logarithmic 
approach, replacing multiplication with addition, might 
be easier for a jury to handle. 


27.7.1 Computer applications 

Bayesian inference has applications in artificial intelli¬ 
gence and expert systems. Bayesian inference techniques 
have been a fundamental part of computerized pattern 
recognition techniques since the late 1950s. There is 
also an ever growing connection between Bayesian meth¬ 
ods and simulation-based Monte Carlo techniques since 
complex models cannot be processed in closed form by 
a Bayesian analysis, while a graphical model structure 
may allow for efficient simulation algorithms like the 
Gibbs sampling and other Metropolis-Hastings algorithm 
schemes. 1 11 Recently Bayesian inference has gained pop¬ 
ularity amongst the phylogenetics community for these 
reasons; a number of applications allow many demo¬ 
graphic and evolutionary parameters to be estimated si¬ 
multaneously. 

As applied to statistical classification, Bayesian infer¬ 
ence has been used in recent years to develop algo¬ 
rithms for identifying e-mail spam. Applications which 
make use of Bayesian inference for spam filtering in¬ 
clude CRM 114, DSP AM, Bogofilter, Spam Assassin, 
SpamBayes, Mozilla, XEAMS, and others. Spam classi¬ 
fication is treated in more detail in the article on the naive 
Bayes classifier. 

Solomonoff’s Inductive inference is the theory of predic¬ 
tion based on observations; for example, predicting the 
next symbol based upon a given series of symbols. The 
only assumption is that the environment follows some un¬ 
known but computable probability distribution. It is a for¬ 
mal inductive framework that combines two well-studied 
principles of inductive inference: Bayesian statistics and 
Occam’s Razor. 1141 Solomonoff’s universal prior proba¬ 
bility of any prefix p of a computable sequence x is the 
sum of the probabilities of all programs (for a universal 
computer) that compute something starting with p. Given 
some p and any computable but unknown probability dis¬ 
tribution from which x is sampled, the universal prior and 
Bayes’ theorem can be used to predict the yet unseen parts 
of x in optimal fashion. 11511161 

27.7.2 In the courtroom 

Bayesian inference can be used by jurors to coherently ac¬ 
cumulate the evidence for and against a defendant, and to 
see whether, in totality, it meets their personal threshold 
for 'beyond a reasonable doubt'. 1171 |1!i| 1191 Bayes’ theorem 
is applied successively to all evidence presented, with the 
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Adding up evidence. 


If the existence of the crime is not in doubt, only the iden¬ 
tity of the culprit, it has been suggested that the prior 
should be uniform over the qualifying population. 1201 
For example, if 1,000 people could have committed the 
crime, the prior probability of guilt would be 1/1000. 

The use of Bayes’ theorem by jurors is controversial. In 
the United Kingdom, a defence expert witness explained 
Bayes’ theorem to the jury in R v Adams. The jury con¬ 
victed, but the case went to appeal on the basis that no 
means of accumulating evidence had been provided for 
jurors who did not wish to use Bayes’ theorem. The Court 
of Appeal upheld the conviction, but it also gave the opin¬ 
ion that “To introduce Bayes’ Theorem, or any similar 
method, into a criminal trial plunges the jury into inap¬ 
propriate and unnecessary realms of theory and complex¬ 
ity, deflecting them from their proper task.” 

Gardner-Medwin 1211 argues that the criterion on which a 
verdict in a criminal trial should be based is not the prob¬ 
ability of guilt, but rather the probability of the evidence, 
given that the defendant is innocent (akin to a frequentist 
p-value). He argues that if the posterior probability of 
guilt is to be computed by Bayes’ theorem, the prior prob¬ 
ability of guilt must be known. This will depend on the 
incidence of the crime, which is an unusual piece of evi¬ 
dence to consider in a criminal trial. Consider the follow¬ 
ing three propositions: 
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A The known facts and testimony could have 
arisen if the defendant is guilty 

B The known facts and testimony could have 
arisen if the defendant is innocent 

C The defendant is guilty. 

Gardner-Medwin argues that the jury should believe both 
A and not-B in order to convict. A and not-B implies the 
truth of C, but the reverse is not true. It is possible that 
B and C are both true, but in this case he argues that a 
jury should acquit, even though they know that they will 
be letting some guilty people go free. See also Lindley’s 
paradox. 

27.7.3 Bayesian epistemology 

Bayesian epistemology is a movement that advocates for 
Bayesian inference as a means of justifying the rules of 
inductive logic. 

Karl Popper and David Miller have rejected the alleged 
rationality of Bayesianism, i.e. using Bayes rule to make 
epistemological inferences: 1221 It is prone to the same 
vicious circle as any other justificationist epistemology, 
because it presupposes what it attempts to justify. Ac¬ 
cording to this view, a rational interpretation of Bayesian 
inference would see it merely as a probabilistic version 
of falsification, rejecting the belief, commonly held by 
Bayesians, that high likelihood achieved by a series of 
Bayesian updates would prove the hypothesis beyond any 
reasonable doubt, or even with likelihood greater than 0. 

27.7.4 Other 

• The scientific method is sometimes interpreted as 
an application of Bayesian inference. In this view, 
Bayes’ rule guides (or should guide) the updating of 
probabilities about hypotheses conditional on new 
observations or experiments. 1231 

• Bayesian search theory is used to search for lost ob¬ 
jects. 

• Bayesian inference in phylogeny 

• Bayesian tool for methylation analysis 


27.8 Bayes and Bayesian inference 

The problem considered by Bayes in Proposition 9 of his 
essay, "An Essay towards solving a Problem in the Doc¬ 
trine of Chances", is the posterior distribution for the pa¬ 
rameter a (the success rate) of the binomial distribution. 


27.9 History 

Main article: History of statistics § Bayesian statistics 

The term Bayesian refers to Thomas Bayes (1702-1761), 
who proved a special case of what is now called Bayes’ 
theorem. However, it was Pierre-Simon Laplace (1749— 
1827) who introduced a general version of the theorem 
and used it to approach problems in celestial mechanics, 
medical statistics, reliability, and jurisprudence. 1241 Early 
Bayesian inference, which used uniform priors follow¬ 
ing Laplace’s principle of insufficient reason, was called 
"inverse probability" (because it infers backwards from 
observations to parameters, or from effects to causes 1 1251 ). 
After the 1920s, “inverse probability” was largely sup¬ 
planted by a collection of methods that came to be called 
frequentist statistics. 1251 

In the 20th century, the ideas of Laplace were further 
developed in two different directions, giving rise to ob¬ 
jective and subjective currents in Bayesian practice. In 
the objective or “non-informative” current, the statisti¬ 
cal analysis depends on only the model assumed, the data 
analyzed, 126 ' and the method assigning the prior, which 
differs from one objective Bayesian to another objective 
Bayesian, fn the subjective or “informative” current, the 
specification of the prior depends on the belief (that is, 
propositions on which the analysis is prepared to act), 
which can summarize information from experts, previous 
studies, etc. 

fn the 1980s, there was a dramatic growth in research 
and applications of Bayesian methods, mostly attributed 
to the discovery of Markov chain Monte Carlo meth¬ 
ods, which removed many of the computational prob¬ 
lems, and an increasing interest in nonstandard, complex 
applications.' 271 Despite growth of Bayesian research, 
most undergraduate teaching is still based on frequentist 
statistics.' 28 ' Nonetheless, Bayesian methods are widely 
accepted and used, such as for example in the field of 
machine learning.' 29 ' 


27.10 See also 

• Bayes’ theorem 

• Bayesian hierarchical modeling 

• Bayesian Analysis, the journal of the ISBA 

• Inductive probability 

• International Society for Bayesian Analysis (ISBA) 

• Jeffreys prior 
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Chapter 28 

Chi-squared distribution 


This article is about the mathematics of the chi-squared 
distribution. For its uses in statistics, see chi-squared 
test. For the music group, see Chi2 (band). 

In probability theory and statistics, the chi-squared dis¬ 
tribution (also chi-square or /^distribution) with k 
degrees of freedom is the distribution of a sum of the 
squares of k independent standard normal random vari¬ 
ables. It is a special case of the gamma distribution and 
is one of the most widely used probability distributions in 
inferential statistics, e.g., in hypothesis testing or in con¬ 
struction of confidence intervals. 132,131141151 When it is be¬ 
ing distinguished from the more general noncentral chi- 
squared distribution, this distribution is sometimes called 
the central chi-squared distribution. 

The chi-squared distribution is used in the common chi- 
squared tests for goodness of fit of an observed distribu¬ 
tion to a theoretical one, the independence of two criteria 
of classification of qualitative data, and in confidence in¬ 
terval estimation for a population standard deviation of 
a normal distribution from a sample standard deviation. 
Many other statistical tests also use this distribution, like 
Friedman’s analysis of variance by ranks. 


the covariance matrix). 181 The idea of a family of “chi- 
squared distributions”, however, is not due to Pearson 
but arose as a further development due to Fisher in the 
1920s. [6] 


28.2 Definition 

If Zi, ..., Zk are independent, standard normal random 
variables, then the sum of their squares, 

Q =E^ 2 , 

i= 1 

is distributed according to the chi-squared distribution 
with k degrees of freedom. This is usually denoted as 


Q ~ x 2 0) or Q ~ xl- 

The chi-squared distribution has one parameter: k — a 
positive integer that specifies the number of degrees of 
freedom (i.e. the number of Zi s) 


28.1 History and name 


28.3 Characteristics 


This distribution was first described by the German statis¬ 
tician Friedrich Robert Helmert in papers of 1875-6, |6!|r 
where he computed the sampling distribution of the sam¬ 
ple variance of a normal population. Thus in German this 
was traditionally known as the Helmert’sche (“Helmer- 
tian”) or “Helmert distribution”. 


Further properties of the chi-squared distribution can be 
found in the box at the upper right corner of this article. 

28.3.1 Probability density function 


The distribution was independently rediscovered by the 
English mathematician Karl Pearson in the context of 
goodness of fit, for which he developed his Pearson’s chi- 
squared test, published in 1900, with computed table of 
values published in (Elderton 1902), collected in (Pearson 
1914, pp. xxxi-xxxiii, 26-28, Table XII). The name 
“chi-squared” ultimately derives from Pearson’s short¬ 
hand for the exponent in a multivariate normal distri¬ 
bution with the Greek letter Chi, writing -'/2X 2 for what 
would appear in modern notation as -V 2 X T 1 r l x (2 being 


The probability density function (pdf) of the chi-squared 
distribution is 


f(x; k ) 


g ( fc /2 — l) e — */2 

2 fc/ 2 r(§) ’ 

0 , 


x > 0; 

otherwise. 


where F(k/ 2) denotes the Gamma function, which has 
closed-form values for integer k. 
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For derivations of the pdf in the cases of one, two and 
k degrees of freedom, see Proofs related to chi-squared 
distribution. 


F(zk ; k) < (ze x ~ z ) kl2 . 


28.3.2 Differential equation 


The tail bound for the cases when z > 1, similarly, is 


The pdf of the chi-squared distribution is a solution to the 
following differential equation: 


2 xf'(x) + f(x)(—k + x + 2) =0, 
2~ k I 2 


/( 1 ) = 


v^r(|) 


1 - F(zk ; k) < (ze 1 ~ z ) k ^ 2 . 

For another approximation for the CDF modeled after 
the cube of a Gaussian, see under Noncentral chi-squared 
distribution. 

28.3.4 Additivity 


28.3.3 Cumulative distribution function 



Chernoff bound for the CDF and tail (1-CDF) of a chi-squared 
random variable with ten degrees of freedom (k = 10) 

Its cumulative distribution function is: 


F(x; k ) 


7(|, f) P (k x\ 

r(l) \2’ 2y 


where y( s,t) is the lower incomplete Gamma function and 
P(s,t) is the regularized Gamma function. 

In a special case of k = 2 this function has a simple form: 


It follows from the definition of the chi-squared distri¬ 
bution that the sum of independent chi-squared variables 
is also chi-squared distributed. Specifically, if {Xi}i = i" 
are independent chi-squared variables with {ki}L i" de¬ 
grees of freedom, respectively, then Y = X\ + ■■■ + Xn 
is chi-squared distributed with k\ + ■ ■ ■ + kn degrees of 
freedom. 


28.3.5 Sample mean 

The sample mean of n i.i.d. chi-squared 
variables of degree k is distributed accord¬ 
ing to a gamma distribution with shape a and 
scale 0 parameters: X = ^ Y^i=i ~ 

Gamma (a = n k/2,6 = 2/ri) where ~ 

X 2 (k) 

Asymptotically, given that for a scale parameter a go¬ 
ing to infinity, a Gamma distribution converges towards a 
Normal distribution with expectation n = a - 6 and vari¬ 
ance a 2 = a 0 2 , the sample mean converges towards: 

X^^ N (H = k, a 2 = 2 k/n) 

Note that we would have obtained the same result invok¬ 
ing instead the central limit theorem, noting that for each 
chi-squared variable of degree k the expectation is k , 
and its variance 2 k (and hence the variance of the sam¬ 
ple mean X being a 2 = 2 k/n ). 


F(x; 2) = 1 - e"5 


28.3.6 Entropy 

The differential entropy is given by 


and the form is not much more complicated for other 
small even k. 

Tables of the chi-squared cumulative distribution func¬ 
tion are widely available and the function is included in 
many spreadsheets and all statistical packages. 

Letting z = x/k , Chernoff bounds on the lower and 
upper tails of the CDF may be obtained. 191 For the cases 
when 0 < 0 < 1 (which include all of the cases when 
this CDF is less than half): 


/ OO 

f(x; k)\nf(x] k) dx = 

-CO 

where %p(x) is the Digamma function. 

The chi-squared distribution is the maximum entropy 
probability distribution for a random variate X for which 
E(X) = k and _E(ln(X)) = if (k/2) + log( 2) are 
fixed. Since the chi-squared is in the family of gamma 
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distributions, this can be derived by substituting appro¬ 
priate values in the Expectation of the Log moment of 
Gamma. For derivation from more basic principles, see 
the derivation in moment generating function of the suf¬ 
ficient statistic. 


28.3.7 Noncentral moments 

The moments about zero of a chi-squared distribution 
with k degrees of freedom are given by 1 1011111 


Median of Chi-Square Distribution 



2 0.10 
£ 

C 0.05 


Degrees of Freedom 


Degrees of Freedom 


E( Y m ) — k(k+2){k+4:) (fc+2?n—2) ~T 2 ) A pproximate formula for median compared with numerical 

r(f) quantile (top). Difference between numerical quantile and ap¬ 
proximate formula (bottom). 


28.3.8 Cumulants 

The cumulants are readily obtained by a (formal) power 
series expansion of the logarithm of the characteristic 
function: 


K n = 2” 1 (n — 1)! k 


28.3.9 Asymptotic properties 

By the central limit theorem, because the chi-squared dis¬ 
tribution is the sum of k independent random variables 
with finite mean and variance, it converges to a normal 
distribution for large k. For many practical purposes, for 
k>50 the distribution is sufficiently close to a normal dis¬ 
tribution for the difference to be ignored. 1121 Specifically, 
if X ~ y 2 (k). then as k tends to infinity, the distribution of 
(.X — k ) /V2k tends to a standard normal distribution. 
However, convergence is slow as the skewness is ^8/k 
and the excess kurtosis is 12 Ik. 

• The sampling distribution of ln(x 2 ) converges to 
normality much faster than the sampling distribu¬ 
tion of z 2 , 11 ' 1 as the logarithm removes much of the 
asymmetry. 1141 Other functions of the chi-squared 
distribution converge more rapidly to a normal dis¬ 
tribution. Some examples are: 

• If X - y 2 (k) then V 2 X is approximately normally dis¬ 
tributed with mean yik -1 and unit variance (result 
credited to R. A. Fisher). 

• If A ~ x\k) then f/xjk is approximately normally 
distributed with mean 1-2 /(9k) and variance 2/(9 k). 
1151 This is known as the Wilson-Hilferty transfor¬ 
mation. 


28.4 Relation to other distributions 

• As k —> 00 , (x| — k)/V2k A iV(0,1) (normal 
distribution) 

• Xk ~ X , fc(0) (Noncentral chi-squared distribution 
with non-centrality parameter A = 0 ) 

• If X ~ F(i/i, V 2 ) then Y = lim ! , 2 _ ) . 00 viX has the 

chi-squared distribution x 2 2 

• As a special case, if X ~ F( 1 , 1 / 2 ) then Y = 
lim^-j-oo X has the chi-squared distribution x? 

• ||7V*=i,...,fc(0,1)|| 2 ~ xt ^he squared norm of 
k standard normally distributed variables is a chi- 
squared distribution with k degrees of freedom) 

• If X ~ X 2 ( z/ ) an d c > 0 , then cA' ~ T(k = 
v/2, 0 = 2c) . (gamma distribution) 

• If X ~ xt then VA ~ Xk (chi distribution) 

• If X ~ X 2 (2) , then X ~ Exp(l/2) is an 
exponential distribution. (See Gamma distribution 
for more.) 

• If X ~ Rayleigh(l) (Rayleigh distribution) then 

A 2 ~ x 2 (2) 

• If A' ~ Maxwell(l) (Maxwell distribution) then 

A 2 ~ x 2 (3) 

• If X ~ x 2 (^) then A ^ Inv-x 2 (i / ) (Inverse-chi- 
squared distribution) 

• The chi-squared distribution is a special case of type 
3 Pearson distribution 
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• If X ~ X 2 (^i) and Y ~ X 2 ( u 2 ) are independent 
then Y=-pp ~ Beta( ^^) (beta distribution) 

• If A' ~ U(0,1) (uniform distribution) then 

-2 log (A) ~x 2 (2) 

• X 2 (6) is a transformation of Laplace distribution 

• If A, ; ~ Laplace^, /3) then X)"=i ~ 

X 2 (2n) 

• chi-squared distribution is a transformation of 
Pareto distribution 

• Student’s t-distribution is a transformation of chi- 
squared distribution 

• Student’s t-distribution can be obtained from chi- 
squared distribution and normal distribution 

• Noncentral beta distribution can be obtained as 
a transformation of chi-squared distribution and 
Noncentral chi-squared distribution 

• Noncentral t-distribution can be obtained from nor¬ 
mal distribution and chi-squared distribution 

A chi-squared variable with k degrees of freedom is de¬ 
fined as the sum of the squares of k independent standard 
normal random variables. 

If Y is a ^-dimensional Gaussian random vector with 
mean vector [i and rank k covariance matrix C, then A 
= (Y-fi) T C~ { (Y-fi) is chi-squared distributed with k de¬ 
grees of freedom. 

The sum of squares of statistically independent unit- 
variance Gaussian variables which do not have mean zero 
yields a generalization of the chi-squared distribution 
called the noncentral chi-squared distribution. 

If Y is a vector of k i.i.d. standard normal random vari¬ 
ables and A is a kxk symmetric, idempotent matrix with 
rank k—n then the quadratic form Y r A Y is chi-squared 
distributed with k—n degrees of freedom. 

The chi-squared distribution is also naturally related to 
other distributions arising from the Gaussian. In particu¬ 
lar, 

• Y is F-distributed, Y - F(k \ ,k 2 ) if y=NsJEl where 
Ai ~ x 2 (k i) and X 2 ~ y 2 (k- 2 ) are statistically inde¬ 
pendent. 

• If A is chi-squared distributed, then Qx is chi dis¬ 
tributed. 

• If Ai ~ x 2 k t and A 2 - x 2 k > are statistically inde¬ 
pendent, then Ai + A 2 - x 2 k \ +k'>- If Ai and A 2 are 
not independent, then Ai + A 2 is not chi-squared 
distributed. 


28.5 Generalizations 

The chi-squared distribution is obtained as the sum of 
the squares of k independent, zero-mean, unit-variance 
Gaussian random variables. Generalizations of this distri¬ 
bution can be obtained by summing the squares of other 
types of Gaussian random variables. Several such distri¬ 
butions are described below. 


28.5.1 Linear combination 

If Xi,...,X n are chi square random variables and 
ai,...,a n G R>o , then a closed expression for the dis¬ 
tribution of A = 1 a 'i Y-i is not known. It may be, 

however, calculated using the property of characteristic 
functions of the chi-squared random variable. 1161 


28.5.2 Chi-squared distributions 

Noncentral chi-squared distribution 

Main article: Noncentral chi-squared distribution 

The noncentral chi-squared distribution is obtained from 
the sum of the squares of independent Gaussian random 
variables having unit variance and nonzero means. 


Generalized chi-squared distribution 

Main article: Generalized chi-squared distribution 

The generalized chi-squared distribution is obtained from 
the quadratic form z'Az where z is a zero-mean Gaussian 
vector having an arbitrary covariance matrix, and A is an 
arbitrary matrix. 


28.5.3 Gamma, exponential, and related 
distributions 

The chi-squared distribution A ~/ 2 (k ) is a special case of 
the gamma distribution, in that A - F(k/2, 1/2) using the 
rate parameterization of the gamma distribution (or A - 
T(k/2, 2) using the scale parameterization of the gamma 
distribution) where k is an integer. 

Because the exponential distribution is also a special case 
of the Gamma distribution, we also have that if A ~ / 2 (2), 
then A - Exp( 1/2) is an exponential distribution. 

The Erlang distribution is also a special case of the 
Gamma distribution and thus we also have that if A - 
x\k) with even k, then A is Erlang distributed with shape 
parameter k !2 and scale parameter 1/2. 
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28.6 Applications 

The chi-squared distribution has numerous applications 
in inferential statistics, for instance in chi-squared tests 
and in estimating variances. It enters the problem of esti¬ 
mating the mean of a normally distributed population and 
the problem of estimating the slope of a regression line 
via its role in Student’s t-distribution. It enters all analysis 
of variance problems via its role in the F-distribution, 
which is the distribution of the ratio of two independent 
chi-squared random variables, each divided by their re¬ 
spective degrees of freedom. 

Following are some of the most common situations 
in which the chi-squared distribution arises from a 
Gaussian-distributed sample. 

• if Xi, .... Xn are i.i.d. N{u, a 2 ) random variables, 

then J2i=i( x i ~ ~ °' 2 Xn-i where X- = 

n 2^i =1 • 

• The box below shows some statistics based on Xi ~ 
Normal(/», o 2 i), i = 1, ■ ■ ■, k. independent random 
variables that have probability distributions related 
to the chi-squared distribution: 


The chi-squared distribution is also often encountered in 
Magnetic Resonance Imaging . |171 

28.7 Table of y 1 value vs p-value 

The p-value is the probability of observing a test statistic 
at least as extreme in a chi-squared distribution. Accord¬ 
ingly, since the cumulative distribution function (CDF) 
for the appropriate degrees of freedom (df) gives the 
probability of having obtained a value less extreme than 
this point, subtracting the CDF value from 1 gives the p- 
value. The table below gives a number of p-values match¬ 
ing to y 2 for the first 10 degrees of freedom. 

A low p-value indicates greater statistical significance, i.e. 
greater confidence that the observed deviation from the 
null hypothesis is significant. A p-value of 0.05 is often 
used as a bright-line cutoff between significant and not- 
significant results. 

28.8 See also 

• Cochran’s theorem 

• F-distribution 

• Fisher’s method for combining independent tests of 
significance 


• Gamma distribution 

• Generalized chi-squared distribution 

• Noncentral chi-squared distribution 

• Hotelling’s T-squared distribution 

• Pearson’s chi-squared test 

• Student’s t-distribution 

• Wilks’ lambda distribution 

• Wishart distribution 


28.9 References 

[ 1 ] M. A. Sanders. “Characteristic function of the central chi- 
squared distribution” (PDF). Retrieved 2009-03-06. 

[2] Abramowitz, Milton; Stegun. Irene A., eds. (1965), 
“Chapter 26”, Handbook of Mathematical Functions with 
Formulas, Graphs, and Mathematical Tables, New York: 
Dover, p. 940, ISBN 978-0486612720, MR 0167642. 

[3] NIST (2006). Engineering Statistics Handbook - Chi- 
Squared Distribution 

[4] Jonhson, N. L.: Kotz, S.: Balakrishnan, N. (1994). “Chi- 
Squared Distributions including Chi and Rayleigh”. Con¬ 
tinuous Univariate Distributions 1 (Second ed.). John Wil¬ 
ley and Sons. pp. 415-493. ISBN 0-471-58495-9. 

[5] Mood, Alexander; Graybill, Franklin A.; Boes, Duane C. 
(1974). Introduction to the Theory of Statistics (Third ed.). 
McGraw-Hill. pp. 241-246. ISBN 0-07-042864-6. 

[6] Hald 1998, pp. 633-692, 27. Sampling Distributions un¬ 
der Normality. 

[7] F. R. Helmert, "Ueber die Wahrscheinlichkeit der Potenz- 
summen der Beobachtungsfehler und iiber einige damit im 
Zusammenhange stehende Fragen", Zeitschrift fiir Math- 
ematik und Physik 21, 1876, S. 102-219 

[8] R. L. Plackett, Karl Pearson and the Chi-Squared Test, In¬ 
ternational Statistical Review, 1983, 61f. See also Jeff 
Miller, Earliest Known Uses of Some of the Words of 
Mathematics. 

[9] Dasgupta, Sanjoy D. A.; Gupta, Anupam K. (2002). “An 
Elementary Proof of a Theorem of Johnson and Linden- 
strauss” (PDF). Random Structures and Algorithms 22: 
60-65. doi: 10.1002/rsa. 10073. Retrieved 2012-05-01. 

[ 10] Chi-squared distribution, from MathWorld, retrieved Feb. 
11,2009 

[11] M. K. Simon, Probability Distributions Involving Gaus¬ 
sian Random Variables , New York: Springer, 2002. eq. 
(2.35), ISBN 978-0-387-34657-1 

[12] Box, Hunter and Hunter (1978). Statistics for experi¬ 
menters. Wiley, p. 118. ISBN 0471093157. 


210 


CHAPTER 28. CHI-SQUARED DISTRIBUTION 


[13] Bartlett, M. S.; Kendall, D. G. (1946). “The Statistical 
Analysis of Variance-Heterogeneity and the Logarithmic 
Transformation”. Supplement to the Journal of the Royal 
Statistical Society 8 (1): 128-138. JSTOR 2983618. 

[14] Shoemaker, Lewis H. (2003). “Fixing the F Test for Equal 
Variances”. The American Statistician 57 (2): 105-114. 
doi: 10.1198/0003130031441. JSTOR 30037243. 

[15] Wilson, E. B.; Hilferty, M. M. (1931). “The distribution 
of chi-squared” (PDF). Proc. Natl. Acad. Sci. USA 17 
(12): 684-688. 

[16] Davies, R.B. (1980). “Algorithm AS155: The Distribu¬ 
tions of a Linear Combination of yf Random Variables”. 
Journal of the Royal Statistical Society 29 (3): 323-333. 
doi: 10.2307/2346911. 

[17] den Dekker A. J., Sijbers J., (2014) “Data distributions in 
magnetic resonance images: a review”, Physica Medica, 

[18] Chi-Squared Test Table B.2. Dr. Jacqueline S. McLaugh¬ 
lin at The Pennsylvania State University. In turn citing: 
R. A. Fisher and F. Yates, Statistical Tables for Biological 
Agricultural and Medical Research, 6th ed., Table IV 


28.10 Further reading 

• Hald, Anders (1998). A history of mathematical 
statistics from 1750 to 1930. New York: Wiley. 
ISBN 0-471-17912-4. 

• Elderton, William Palin (1902). “Tables for 

Testing the Goodness of Fit of Theory to 
Observation”. Biometrika 1 (2): 155-163. 

doi : 10.1093/biomet/1.2.155. 


28.11 External links 

• Hazewinkel, Michiel, ed. (2001), “Chi-squared dis¬ 
tribution”, Encyclopedia of Mathematics , Springer, 
ISBN 978-1-55608-010-4 

• Calculator for the pdf, cdf and quantiles of the chi- 
squared distribution 

• Earliest Uses of Some of the Words of Mathematics: 
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Chapter 29 

Chi-squared test 



X 2 = Pearson's cumulative test statistic 

Chi-square distribution, showing X 2 on the x-axis and P-value 
on the y-axis. 


A chi-squared test, also referred to as y 2 test (or chi- 
square test), is any statistical hypothesis test in which the 
sampling distribution of the test statistic is a chi-square 
distribution when the null hypothesis is true. Chi-squared 
tests are often constructed from a sum of squared errors, 
or through the sample variance. Test statistics that fol¬ 
low a chi-squared distribution arise from an assumption 
of independent normally distributed data, which is valid 
in many cases due to the central limit theorem. A chi- 
squared test can then be used to reject the hypothesis that 
the data are independent. 

Also considered a chi-square test is a test in which this 
is asymptotically true, meaning that the sampling distri¬ 
bution (if the null hypothesis is true) can be made to ap¬ 
proximate a chi-square distribution as closely as desired 
by making the sample size large enough. The chi-squared 
test is used to determine whether there is a significant 
difference between the expected frequencies and the ob¬ 
served frequencies in one or more categories. Does the 
number of individuals or objects that fall in each category 
differ significantly from the number you would expect? 
Is this difference between the expected and observed due 
to sampling variation, or is it a real difference? 


29.1 Examples of chi-square tests 
with samples 

One test statistic that follows a chi-square distribution ex¬ 
actly is the test that the variance of a normally distributed 
population has a given value based on a sample variance. 
Such tests are uncommon in practice because the true 
variance of the population is usually unknown. However, 
there are several statistical tests where the chi-square dis¬ 
tribution is approximately valid: 


29.1.1 Pearson’s chi-square test 

Main article: Pearson’s chi-square test 

Pearson’s chi-square test, also known as the chi-square 
goodness-of-fit test or chi-square test for independence. 
When the chi-square test is mentioned without any modi¬ 
fiers or without other precluding context, this test is often 
meant (for an exact test used in place of x' 2 , see Fisher’s 
exact test). 


29.1.2 Yates’s correction for continuity 

Main article: Yates’s correction for continuity 

Using the chi-square distribution to interpret Pearson’s 
chi-square statistic requires one to assume that the 
discrete probability of observed binomial frequencies in 
the table can be approximated by the continuous chi- 
square distribution. This assumption is not quite correct, 
and introduces some error. 

To reduce the error in approximation, Frank Yates sug¬ 
gested a correction for continuity that adjusts the formula 
for Pearson’s chi-square test by subtracting 0.5 from the 
difference between each observed value and its expected 
value in a 2 x 2 contingency table. 111 This reduces the chi- 
square value obtained and thus increases its p-value. 
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29.1.3 Other chi-square tests 


• Cochran-Mantel-Haenszel chi-squared test. 

• McNemar’s test, used in certain 2x2 tables with 
pairing 


150 

650 


x 34? x 650 
X 650 X 55U 


80.54. 


Then in that “cell” of the table, we have 


• Tukey’s test of additivity 

• The portmanteau test in time-series analysis, testing 
for the presence of autocorrelation 

• Likelihood-ratio tests in general statistical mod¬ 
elling, for testing whether there is evidence of the 
need to move from a simple model to a more compli¬ 
cated one (where the simple model is nested within 
the complicated one). 


(observed — expected) 2 (90 — 80.54) 2 
expected 80.54 

The sum of these quantities over all of the cells is the test 
statistic. Under the null hypothesis, it has approximately a 
chi-square distribution whose number of degrees of free¬ 
dom is 


29.2 Chi-squared test for variance 
in a normal population 

If a sample of size n is taken from a population having a 
normal distribution, then there is a result (see distribution 
of the sample variance) which allows a test to be made 
of whether the variance of the population has a pre¬ 
determined value. For example, a manufacturing process 
might have been in stable condition for a long period, al¬ 
lowing a value for the variance to be determined essen¬ 
tially without error. Suppose that a variant of the process 
is being tested, giving rise to a small sample of n product 
items whose variation is to be tested. The test statistic 
T in this instance could be set to be the sum of squares 
about the sample mean, divided by the nominal value for 
the variance (i.e. the value to be tested as holding). Then 
T has a chi-square distribution with n— 1 degrees of free¬ 
dom. For example if the sample size is 21, the acceptance 
region for T for a significance level of 5% is the interval 
9.59 to 34.17. 

29.3 Example chi-squared test for 
categorical data 

Suppose there is a city of 1 million residents with four 
neighborhoods: A, B, C, and D. A random sample of 
650 residents of the city is taken and their occupation 
is recorded as “blue collar”, “white collar”, or “service”. 
The null hypothesis is that each person’s neighborhood 
of residence is independent of the person’s occupational 
classification. The data are tabulated as: 

Let us take the sample living in neighborhood A, 
150/650, to estimate what proportion of the whole 1 mil¬ 
lion people live in neighborhood A. Similarly we take 
349/650 to estimate what proportion of the 1 million peo¬ 
ple are blue-collar workers. By the assumption of inde¬ 
pendence under the hypothesis we should “expect” the 
number of blue-collar workers in neighborhood A to be 


(rows of number—1)(columns of number—1) = (3—1)(4—1) 

If the test statistic is improbably large according to that 
chi-square distribution, then one rejects the null hypoth¬ 
esis of independence. 

A related issue is a test of homogeneity. Suppose that 
instead of giving every resident of each of the four neigh¬ 
borhoods an equal chance of inclusion in the sample, we 
decide in advance how many residents of each neighbor¬ 
hood to include. Then each resident has the same chance 
of being chosen as do all residents of the same neigh¬ 
borhood, but residents of different neighborhoods would 
have different probabilities of being chosen if the four 
sample sizes are not proportional to the populations of 
the four neighborhoods. In such a case, we would be 
testing “homogeneity” rather than “independence”. The 
question is whether the proportions of blue-collar, white- 
collar, and service workers in the four neighborhoods are 
the same. However, the test is done in the same way. 


29.4 Applications 

In cryptanalysis, chi-square test is used to compare 
the distribution of plaintext and (possibly) decrypted 
ciphertext. The lowest value of the test means that the 
decryption was successful with high probability. 121131 This 
method can be generalized for solving modern crypto¬ 
graphic problems. 141 


29.5 See also 

• Chi-square test nomogram 

• G-test 

• Minimum chi-square estimation 

• The Wald test can be evaluated against a chi-square 
distribution. 
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Chapter 30 

Goodness of fit 


The goodness of fit of a statistical model describes how 
well it fits a set of observations. Measures of goodness 
of fit typically summarize the discrepancy between ob¬ 
served values and the values expected under the model 
in question. Such measures can be used in statistical hy¬ 
pothesis testing, e.g. to test for normality of residuals, 
to test whether two samples are drawn from identical 
distributions (see Kolmogorov-Smirnov test), or whether 
outcome frequencies follow a specified distribution (see 
Pearson’s chi-squared test). In the analysis of variance, 
one of the components into which the variance is parti¬ 
tioned may be a lack-of-fit sum of squares. 

30.1 Fit of distributions 

In assessing whether a given distribution is suited to a 
data-set, the following tests and their underlying measures 
of fit can be used: 

• Kolmogorov-Smirnov test; 

• Cramer-von Mises criterion; 

• Anderson-Darling test; 

• Shapiro-Wilk test; 

• Chi Square test; 

• Akaike information criterion; 

• Hosmer-Lemeshow test; 

30.2 Regression analysis 

In regression analysis, the following topics relate to good¬ 
ness of fit: 

• Coefficient of determination (The R 
squared measure of goodness of fit); 

• Lack-of-fit sum of squares. 

30.2.1 Example 

One way in which a measure of goodness of fit statistic 
can be constructed, in the case where the variance of the 


measurement error is known, is to construct a weighted 
sum of squared errors: 


2 = y- (O ~ E ) 2 

* ct 2 

where a 2 is the known variance of the observation, O 
is the observed data and E is the theoretical data. 111 This 
definition is only useful when one has estimates for the er¬ 
ror on the measurements, but it leads to a situation where 
a chi-squared distribution can be used to test goodness 
of fit, provided that the errors can be assumed to have a 
normal distribution. 

The reduced chi-squared statistic is simply the chi- 
squared divided by the number of degrees of free- 

d om .[l][2][3][4] 


v 2 x 2 _ 1 v (Q-ff ) 2 

A,red / j 9 

V V G z 

where v is the number of degrees of freedom, usually 
given by N — n — 1, where N is the number of observa¬ 
tions, and n is the number of fitted parameters, assuming 
that the mean value is an additional fitted parameter. The 
advantage of the reduced chi-squared is that it already 
normalizes for the number of data points and model com¬ 
plexity. This is also known as the mean square weighted 
deviation. 

As a rule of thumb (again valid only when the variance 
of the measurement error is known a priori rather than 
estimated from the data), a \f ed 1 indicates a poor 
model fit. A x^a > 1 indicates that the fit has not fully 
captured the data (or that the error variance has been un¬ 
derestimated). In principle, a value of xf ed = 1 indicates 
that the extent of the match between observations and es¬ 
timates is in accord with the error variance. A x 2 ei < 1 
indicates that the model is 'over-fitting' the data: either 
the model is improperly fitting noise, or the error vari¬ 
ance has been overestimated. 151 
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30.3 Categorical data 

The following are examples that arise in the context of 
categorical data. 

30.3.1 Pearson’s chi-squared test 

Pearson’s chi-squared test uses a measure of goodness of 
fit which is the sum of differences between observed and 
expected outcome frequencies (that is, counts of obser¬ 
vations), each squared and divided by the expectation: 

2 _ (Oj ~~ Ej) 

x ~ hi 

where: 

Oi = an observed frequency (i.e. count) for bin 
i 

Ei = an expected (theoretical) frequency for bin 
!, asserted by the null hypothesis. 

The expected frequency is calculated by: 

Ei = (f(Y u ) - F(Y,)j N 

where: 

F = the cumulative Distribution function for 
the distribution being tested. 

Yu = the upper limit for class i, 

Yl = the lower limit for class i, and 
N = the sample size 

The resulting value can be compared to the chi-squared 
distribution to determine the goodness of fit. In order to 
determine the degrees of freedom of the chi-squared dis¬ 
tribution, one takes the total number of observed frequen¬ 
cies and subtracts the number of estimated parameters. 
The test statistic follows, approximately, a chi-square dis¬ 
tribution with (k - c) degrees of freedom where k is the 
number of non-empty cells and c is the number of esti¬ 
mated parameters (including location and scale parame¬ 
ters and shape parameters) for the distribution. 

Example: equal frequencies of men and women 

For example, to test the hypothesis that a random sam¬ 
ple of 100 people has been drawn from a population in 
which men and women are equal in frequency, the ob¬ 
served number of men and women would be compared 


to the theoretical frequencies of 50 men and 50 women. 
If there were 44 men in the sample and 56 women, then 

2 _ (44 - 50 f (56 - 50) 2 
50 50 

If the null hypothesis is true (i.e., men and women are 
chosen with equal probability in the sample), the test 
statistic will be drawn from a chi-squared distribution 
with one degree of freedom. Though one might ex¬ 
pect two degrees of freedom (one each for the men and 
women), we must take into account that the total number 
of men and women is constrained (100), and thus there is 
only one degree of freedom (2 - 1). Alternatively, if the 
male count is known the female count is determined, and 
vice versa. 

Consultation of the chi-squared distribution for 1 degree 
of freedom shows that the probability of observing this 
difference (or a more extreme difference than this) if men 
and women are equally numerous in the population is ap¬ 
proximately 0.23. This probability is higher than con¬ 
ventional criteria for statistical significance (.001-.05), so 
normally we would not reject the null hypothesis that the 
number of men in the population is the same as the num¬ 
ber of women (i.e. we would consider our sample within 
the range of what we'd expect for a 50/50 male/female 
ratio.) 

30.3.2 Binomial case 

A binomial experiment is a sequence of independent tri¬ 
als in which the trials can result in one of two outcomes, 
success or failure. There are n trials each with probability 
of success, denoted by p. Provided that npi » 1 for every 
i (where i = 1,2, ..., k), then 

2 _ (Nj-npi ) 2 _ (O-E) 2 

A- 2-^i= 1 npi 4—-/all cells E 

This has approximately a chi-squared distribution with k 
- 1 df. The fact that df = k - 1 is a consequence of the 
restriction Yl Ni = n . We know there are k observed 
cell counts, however, once any k - 1 are known, the re¬ 
maining one is uniquely determined. Basically, one can 
say, there are only k — 1 freely determined cell counts, 
thus df = k - 1. 


30.4 Other measures of fit 

The likelihood ratio test statistic is a measure of the good¬ 
ness of fit of a model, judged by whether an expanded 
form of the model provides a substantially improved fit. 

30.5 See also 

• Deviance (statistics) (related to GLM) 
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• Overfitting 
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Chapter 31 

Likelihood-ratio test 


Not to be confused with the use of likelihood ratios in 
diagnostic testing. 

In statistics, a likelihood ratio test is a statistical test 
used to compare the goodness of fit of two models, one of 
which (the null model) is a special case of the other (the 
alternative model). The test is based on the likelihood 
ratio, which expresses how many times more likely the 
data are under one model than the other. This likelihood 
ratio, or equivalently its logarithm, can then be used to 
compute a p-value, or compared to a critical value to de¬ 
cide whether to reject the null model in favour of the al¬ 
ternative model. When the logarithm of the likelihood 
ratio is used, the statistic is known as a log-likelihood ra¬ 
tio statistic, and the probability distribution of this test 
statistic, assuming that the null model is true, can be ap¬ 
proximated using Wilks’s theorem. 

In the case of distinguishing between two models, each of 
which has no unknown parameters, use of the likelihood 
ratio test can be justified by the Neyman-Pearson lemma, 
which demonstrates that such a test has the highest power 
among all competitors. 11 1 


. |21 Symbols df 1 and df'2 represent the number of free 
parameters of models 1 and 2, the null model and the al¬ 
ternative model, respectively. 

Here is an example of use. If the null model has 1 
parameter and a log-likelihood of -8024 and the alter¬ 
native model has 3 parameters and a log-likelihood of 
-8012, then the probability of this difference is that of 
chi-squared value of +2-(8024 - 8012) = 24 with 3-1 
= 2 degrees of freedom. Certain assumptions 131 must be 
met for the statistic to follow a chi-squared distribution, 
and often empirical /?-values are computed. 

The likelihood-ratio test requires nested models, i.e. 
models in which the more complex one can be trans¬ 
formed into the simpler model by imposing a set of con¬ 
straints on the parameters. If the models are not nested, 
then a generalization of the likelihood-ratio test can usu¬ 
ally be used instead: the relative likelihood. 

31.2 Simple -vs-simple hypotheses 

Main article: Neyman-Pearson lemma 


31.1 Use 

Each of the two competing models, the null model and the 
alternative model, is separately fitted to the data and the 
log-likelihood recorded. The test statistic (often denoted 
by D ) is twice the difference in these log-likelihoods: 


A statistical model is often a parametrized family of 
probability density functions or probability mass func¬ 
tions f(x\8) . A simple-vs-simple hypothesis test has 
completely specified models under both the null and 
alternative hypotheses, which for convenience are writ¬ 
ten in terms of fixed values of a notional parameter 9 : 


D = —2 In 


/ model null for likelihood \ 
\ model alternative for likelihood J 


H 0 : 9 — 9q, 

H 1 : 9 = 9\. 


= —21n(model null for likelihood) + 21n(model alternatiwceftIrakiMtacallier hypothesis, the distribution of the 


The model with more parameters will always fit at least as 
well (have an equal or greater log-likelihood). Whether it 
fits significantly better and should thus be preferred is de¬ 
termined by deriving the probability or p- value of the dif¬ 
ference D. Where the null hypothesis represents a special 
case of the alternative hypothesis, the probability distri¬ 
bution of the test statistic is approximately a chi-squared 
distribution with degrees of freedom equal to dfl - df 1 


data is fully specified; there are no unknown parameters 
to estimate. The likelihood ratio test is based on the like¬ 
lihood ratio, which is often denoted by A (the capital 
Greek letter lambda). The likelihood ratio is defined as 
follows: [41 [5! 


. , , _ L{6 0 \x) _ f{U.iXi\9 0 ) 

[X> ~ ^x) ~ /(Uj Xi\9i) 


111 
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or 


sup{L(0 | x) : 9 e {6» 0 ,6»i}}’ 

where L{9\x) is the likelihood function, and sup is the 
supremum function. Note that some references may use 
the reciprocal as the definition. 161 In the form stated here, 
the likelihood ratio is small if the alternative model is bet¬ 
ter than the null model and the likelihood ratio test pro¬ 
vides the decision rule as follows: 

If A > c , do not reject Hg ; 

If A < c , reject Hq ; 

Reject with probability q if A = c. 

The values c, q are usually chosen to obtain a speci¬ 
fied significance level a , through the relation q ■ P(A = 
c | H 0 ) + P(A < c | H 0 ) = a . The Neyman-Pearson 
lemma states that this likelihood ratio test is the most 
powerful among all level a tests for this problem. 1 11 

31.3 Definition (likelihood ratio 
test for composite hypotheses) 

A null hypothesis is often stated by saying the parameter 
9 is in a specified subset 0o of the parameter space 0 . 

Ho : e e 00 
H 1 : 0e0§ 

The likelihood function is L(9\x) = f{x\9) (with f(x\9) 
being the pdf or pmf), which is a function of the param¬ 
eter 9 with x held fixed at the value that was actually ob¬ 
served, i.e., the data. The likelihood ratio test statistic 
is 

= sup{ L{9 I x) : 9 € 0Q } 

X sup{ L(9 | x) : 9 £ 0 } 

Here, the sup notation refers to the supremum function. 

A likelihood ratio test is any test with critical region 
(or rejection region) of the form (a;|A < c} where c 
is any number satisfying 0 < c < 1 . Many common 
test statistics such as the Z-test, the /-’-test, Pearson’s chi- 
squared test and the G-test are tests for nested models and 
can be phrased as log-likelihood ratios or approximations 
thereof. 


31.3.1 Interpretation 

Being a function of the data x , the likelihood ratio is 
therefore a statistic. The likelihood ratio test rejects the 
null hypothesis if the value of this statistic is too small. 
How small is too small depends on the significance level 
of the test, i.e., on what probability of Type I error is con¬ 
sidered tolerable (“Type I” errors consist of the rejection 
of a null hypothesis that is true). 

The numerator corresponds to the maximum likelihood 
of an observed outcome under the null hypothesis. The 
denominator corresponds to the maximum likelihood of 
an observed outcome varying parameters over the whole 
parameter space. The numerator of this ratio is less than 
the denominator. The likelihood ratio hence is between 
0 and 1. Low values of the likelihood ratio mean that 
the observed result was less likely to occur under the null 
hypothesis as compared to the alternative. High values of 
the statistic mean that the observed outcome was nearly as 
likely to occur under the null hypothesis as the alternative, 
and the null hypothesis cannot be rejected. 

31.3.2 Distribution: Wilks’s theorem 

If the distribution of the likelihood ratio corresponding to 
a particular null and alternative hypothesis can be explic¬ 
itly determined then it can directly be used to form deci¬ 
sion regions (to accept/reject the null hypothesis). In most 
cases, however, the exact distribution of the likelihood ra¬ 
tio corresponding to specific hypotheses is very difficult 
to determine. A convenient result, attributed to Samuel 
S. Wilks, says that as the sample size n approaches oo 
, the test statistic — 21og(A) for a nested model will be 
asymptotically -distributed with degrees of freedom 
equal to the difference in dimensionality of 0 and 0o J 3 -' 
This means that for a great variety of hypotheses, a prac¬ 
titioner can compute the likelihood ratio A for the data 
and compare —2 log(A) to the \ 2 value corresponding to 
a desired statistical significance as an approximate statis¬ 
tical test. 


31.4 Examples 

31.4.1 Coin tossing 

An example, in the case of Pearson’s test, we might try 
to compare two coins to determine whether they have the 
same probability of coming up heads. Our observation 
can be put into a contingency table with rows correspond¬ 
ing to the coin and columns corresponding to heads or 
tails. The elements of the contingency table will be the 
number of times the coin for that row came up heads or 
tails. The contents of this table are our observation X . 

Here 0 consists of the possible combinations of values 
of the parameters pm , p 1T , p 2 n . and p 2 T , which are 
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the probability that coins 1 and 2 come up heads or tails. 
In what follows, * = 1,2 and j = H,T . The hypothesis 
space H is constrained by the usual constraints on a prob¬ 
ability distribution, 0 < p,; 7 < 1 , and p^j + p^ = 1 . 
The space of the null hypothesis H 0 is the subspace where 
Pij = P2j ■ Writing ny for the best values for p^j under 
the hypothesis H , the maximum likelihood estimate is 
given by 

kij 

kiH+kiT * 

Similarly, the maximum likelihood estimates of p l3 under 
the null hypothesis H 0 are given by 

k i j ~\~ k%j 

kiH+k 2 H+kiT+k 2 T ’ 

which does not depend on the coin i. 

The hypothesis and null hypothesis can be rewritten 
slightly so that they satisfy the constraints for the log¬ 
arithm of the likelihood ratio to have the desired nice 
distribution. Since the constraint causes the two- 
dimensional H to be reduced to the one-dimensional // () 
, the asymptotic distribution for the test will be % 2 (1) , 
the x 2 distribution with one degree of freedom. 

For the general contingency table, we can write the log- 
likelihood ratio statistic as 


31.6 External links 

• Practical application of likelihood ratio test de¬ 
scribed 

• R Package: Wald’s Sequential Probability Ratio Test 

• Richard Lowry’s Predictive Values and Likelihood 
Ratios Online Clinical Calculator 


-2 log A = 2^% log 


n ij 

rriij 
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Chapter 32 

Statistical classification 


For the unsupervised learning approach, see Cluster 
analysis. 

In machine learning and statistics, classification is the 
problem of identifying to which of a set of categories 
(sub-populations) a new observation belongs, on the ba¬ 
sis of a training set of data containing observations (or 
instances) whose category membership is known. An ex¬ 
ample would be assigning a given email into “spam” or 
“non-spam” classes or assigning a diagnosis to a given pa¬ 
tient as described by observed characteristics of the pa¬ 
tient (gender, blood pressure, presence or absence of cer¬ 
tain symptoms, etc.). 

In the terminology of machine learning, 11 1 classification is 
considered an instance of supervised learning, i.e. learn¬ 
ing where a training set of correctly identified observa¬ 
tions is available. The corresponding unsupervised pro¬ 
cedure is known as clustering, and involves grouping data 
into categories based on some measure of inherent simi¬ 
larity or distance. 

Often, the individual observations are analyzed into a set 
of quantifiable properties, known variously explanatory 
variables, features , etc. These properties may vari¬ 
ously be categorical (e.g. “A”, “B”, “AB” or “O”, for 
blood type), ordinal (e.g. “large”, “medium” or “small”), 
integer-valued (e.g. the number of occurrences of a part 
word in an email) or real-valued (e.g. a measurement 
of blood pressure). Other classifiers work by compar¬ 
ing observations to previous observations by means of a 
similarity or distance function. 

An algorithm that implements classification, especially in 
a concrete implementation, is known as a classifier. The 
term “classifier” sometimes also refers to the mathemat¬ 
ical function, implemented by a classification algorithm, 
that maps input data to a category. 

Terminology across fields is quite varied. In statistics, 
where classification is often done with logistic regres¬ 
sion or a similar procedure, the properties of observa¬ 
tions are termed explanatory variables (or independent 
variables, regressors, etc.), and the categories to be pre¬ 
dicted are known as outcomes, which are considered to 
be possible values of the dependent variable. In ma¬ 
chine learning, the observations are often known as in¬ 


stances, the explanatory variables are termed features 
(grouped into a feature vector), and the possible cate¬ 
gories to be predicted are classes. There is also some ar¬ 
gument over whether classification methods that do not 
involve a statistical model can be considered “statisti¬ 
cal”. Other fields may use different terminology: e.g. 
in community ecology, the term “classification” normally 
refers to cluster analysis, i.e. a type of unsupervised 
learning, rather than the supervised learning described in 
this article. 


32.1 Relation to other problems 

Classification and clustering are examples of the more 
general problem of pattern recognition, which is the as¬ 
signment of some sort of output value to a given in¬ 
put value. Other examples are regression, which assigns 
a real-valued output to each input; sequence labeling, 
which assigns a class to each member of a sequence of 
values (for example, part of speech tagging, which as¬ 
signs a part of speech to each word in an input sentence); 
parsing, which assigns a parse tree to an input sentence, 
describing the syntactic structure of the sentence; etc. 

A common subclass of classification is probabilistic clas¬ 
sification. Algorithms of this nature use statistical in¬ 
ference to find the best class for a given instance. Un¬ 
like other algorithms, which simply output a “best” class, 
probabilistic algorithms output a probability of the in¬ 
stance being a member of each of the possible classes. 
The best class is normally then selected as the one with 
the highest probability. However, such an algorithm has 
numerous advantages over non-probabilistic classifiers: 

• It can output a confidence value associated with its 
choice (in general, a classifier that can do this is 
known as a confidence-weighted classifier ). 

• Correspondingly, it can abstain when its confidence 
of choosing any particular output is too low. 

• Because of the probabilities which are generated, 
probabilistic classifiers can be more effectively in¬ 
corporated into larger machine-learning tasks, in a 
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way that partially or completely avoids the problem 
of error propagation. 


32.2 Frequentist procedures 

Early work on statistical classification was undertaken by 
Fisher, 121131 in the context of two-group problems, leading 
to Fisher’s linear discriminant function as the rule for as¬ 
signing a group to a new observation. 141 This early work 
assumed that data-values within each of the two groups 
had a multivariate normal distribution. The extension 
of this same context to more than two-groups has also 
been considered with a restriction imposed that the clas¬ 
sification rule should be linear. 141151 Eater work for the 
multivariate normal distribution allowed the classifier to 
be nonlinear: 161 several classification rules can be derived 
based on slight different adjustments of the Mahalanobis 
distance, with a new observation being assigned to the 
group whose centre has the lowest adjusted distance from 
the observation. 


32.3 Bayesian procedures 

Unlike frequentist procedures, Bayesian classification 
procedures provide a natural way of taking into ac¬ 
count any available information about the relative sizes of 
the sub-populations associated with the different groups 
within the overall population. 171 Bayesian procedures tend 
to be computationally expensive and, in the days before 
Markov chain Monte Carlo computations were devel¬ 
oped, approximations for Bayesian clustering rules were 
devised. 181 

Some Bayesian procedures involve the calculation of 
group membership probabilities: these can be viewed as 
providing a more informative outcome of a data analysis 
than a simple attribution of a single group-label to each 
new observation. 

32.4 Binary and multiclass classifi¬ 
cation 

Classification can be thought of as two separate problems 
- binary classification and multiclass classification. In 
binary classification, a better understood task, only two 
classes are involved, whereas multiclass classification in¬ 
volves assigning an object to one of several classes. 191 
Since many classification methods have been developed 
specifically for binary classification, multiclass classifica¬ 
tion often requires the combined use of multiple binary 
classifiers. 


32.5 Feature vectors 

Most algorithms describe an individual instance whose 
category is to be predicted using a feature vector of indi¬ 
vidual, measurable properties of the instance. Each prop¬ 
erty is termed a feature, also known in statistics as an 
explanatory variable (or independent variable, although in 
general different features may or may not be statistically 
independent). Features may variously be binary (“male” 
or “female”); categorical (e.g. “A”, “B”, “AB” or “O”, for 
blood type); ordinal (e.g. “large”, “medium” or “small”); 
integer-valued (e.g. the number of occurrences of a par¬ 
ticular word in an email); or real-valued (e.g. a measure¬ 
ment of blood pressure). If the instance is an image, the 
feature values might correspond to the pixels of an image; 
if the instance is a piece of text, the feature values might 
be occurrence frequencies of different words. Some al¬ 
gorithms work only in terms of discrete data and require 
that real-valued or integer-valued data be discretized into 
groups (e.g. less than 5, between 5 and 10, or greater than 
10 ). 

The vector space associated with these vectors is often 
called the feature space. In order to reduce the dimen¬ 
sionality of the feature space, a number of dimensionality 
reduction techniques can be employed. 

32.6 Linear classifiers 

A large number of algorithms for classification can be 
phrased in terms of a linear function that assigns a score 
to each possible category k by combining the feature vec¬ 
tor of an instance with a vector of weights, using a dot 
product. The predicted category is the one with the high¬ 
est score. This type of score function is known as a linear 
predictor function and has the following general form: 

score(Xj.fc) = (3 k ■ X*, 

where Xz is the feature vector for instance z, (5A: is the vec¬ 
tor of weights corresponding to category k, and score(Xz, 
k) is the score associated with assigning instance i to cat¬ 
egory k. In discrete choice theory, where instances rep¬ 
resent people and categories represent choices, the score 
is considered the utility associated with person z choosing 
category k. 

Algorithms with this basic setup are known as linear clas¬ 
sifiers. What distinguishes them is the procedure for de¬ 
termining (training) the optimal weights/coefficients and 
the way that the score is interpreted. 

Examples of such algorithms are 

• Logistic regression and Multinomial logistic regres¬ 
sion 

• Probit regression 
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• The perception algorithm 

• Support vector machines 

• Linear discriminant analysis. 

32.7 Algorithms 

Examples of classification algorithms include: 

• Linear classifiers 

• Fisher’s linear discriminant 

• Logistic regression 

• Naive Bayes classifier 

• Perceptron 

• Support vector machines 

• Least squares support vector machines 

• Quadratic classifiers 

• Kernel estimation 

• k-nearest neighbor 

• Boosting (meta-algorithm) 

• Decision trees 

• Random forests 

• Neural networks 

• Learning vector quantization 

32.8 Evaluation 

Classifier performance depends greatly on the character¬ 
istics of the data to be classified. There is no single classi¬ 
fier that works best on all given problems (a phenomenon 
that may be explained by the no-free-lunch theorem). 
Various empirical tests have been performed to compare 
classifier performance and to find the characteristics of 
data that determine classifier performance. Determining 
a suitable classifier for a given problem is however still 
more an art than a science. 

The measures precision and recall are popular metrics 
used to evaluate the quality of a classification system. 
More recently, receiver operating characteristic (ROC) 
curves have been used to evaluate the tradeoff between 
true- and false-positive rates of classification algorithms. 

As a performance metric, the uncertainty coefficient has 
the advantage over simple accuracy in that it is not af¬ 
fected by the relative sizes of the different classes. 1101 
Further, it will not penalize an algorithm for simply rear¬ 
ranging the classes. 


32.9 Application domains 

See also: Cluster analysis § Applications 

Classification has many applications. In some of these it 
is employed as a data mining procedure, while in others 
more detailed statistical modeling is undertaken. 

• Computer vision 

• Medical imaging and medical image analysis 

• Optical character recognition 

• Video tracking 

• Drug discovery and development 

• Toxicogenomics 

• Quantitative structure-activity relationship 

• Geostatistics 

• Speech recognition 

• Handwriting recognition 

• Biometric identification 

• Biological classification 

• Statistical natural language processing 

• Document classification 

• Internet search engines 

• Credit scoring 

• Pattern recognition 

• Micro-array classification 

32.10 See also 

• Class membership probabilities 

• Classification rule 

• Binary classification 

• Compound term processing 

• Data mining 

• Fuzzy logic 

• Data warehouse 

• Information retrieval 

• Artificial intelligence 

• Machine learning 

• Recommender system 
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32.12 External links 

• Classifier showdown A practical comparison of clas¬ 
sification algorithms. 

• Statistical Pattern Recognition Toolbox for Matlab. 

• TOOLDIAG Pattern recognition toolbox. 

• Statistical classification software based on adaptive 
kernel density estimation. 

• PAL Classification Suite written in Java. 

• kNN and Potential energy (Applet), University of 
Leicester 

• scikit-learn a widely used package in python 

• Weka A java based package with an extensive vari¬ 
ety of algorithms. 


Chapter 33 

Binary classification 


Binary or binomial classification is the task of 
classifying the elements of a given set into two groups 
on the basis of a classification rule. Some typical binary 
classification tasks are: 

• medical testing to determine if a patient has cer¬ 
tain disease or not - the classification property is the 
presence of the disease; 

• A “pass or fail” test method or quality control in fac¬ 
tories; i.e. deciding if a specification has or has not 
been met: a Go/no go classification. 

• An item may have a qualitative property; it does or 
does not have a specified characteristic 

• information retrieval, namely deciding whether a 
page or an article should be in the result set of a 
search or not - the classification property is the rel¬ 
evance of the article, or the usefulness to the user. 

An important point is that in many practical binary clas¬ 
sification problems, the two groups are not symmetric - 
rather than overall accuracy, the relative proportion of 
different types of errors is of interest. For example, in 
medical testing, a false positive (detecting a disease when 
it is not present) is considered differently from a false neg¬ 
ative (not detecting a disease when it is present). 

Statistical classification in general is one of the prob¬ 
lems studied in computer science, in order to automat¬ 
ically learn classification systems; some methods suitable 
for learning binary classifiers include the decision trees, 
Bayesian networks, support vector machines, neural net¬ 
works, probit regression, and logistic regression. 

Sometimes, classification tasks are trivial. Given 100 
balls, some of them red and some blue, a human with 
normal color vision can easily separate them into red ones 
and blue ones. However, some tasks, like those in prac¬ 
tical medicine, and those interesting from the computer 
science point of view, are far from trivial, and may pro¬ 
duce faulty results if executed imprecisely. 


33.1 Evaluation of binary classi¬ 
fiers 


Main article: Evaluation of binary classifiers 
There are many metrics that can be used to measure the 



performance of a classifier or predictor; different fields 
have different preferences for specific metrics due to dif¬ 
ferent goals. For example, in medicine sensitivity and 
specificity are often used, while in information retrieval 
precision and recall are preferred. An important dis¬ 
tinction is between metrics that are independent on the 
prevalence (how often each category occurs in the popu¬ 
lation), and metrics that depend on the prevalence - both 
types are useful, but they have very different properties. 

Given a classification of a specific data set, there are four 
basic data: the number of true positives (TP), true nega¬ 
tives (TN), false positives (FP), and false negatives (FN). 
These can be arranged into a 2x2 contingency table, with 
columns corresponding to actual value - condition pos¬ 
itive (CP) or condition negative (CN) - and rows cor¬ 
responding to classification value - test outcome posi¬ 
tive or test outcome negative. There are eight basic ra- 
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tios that one can compute from this table, which come 
in four complementary pairs (each pair summing to 1). 
These are obtained by dividing each of the four numbers 
by the sum of its row or column, yielding eight numbers, 
which can be referred to generically in the form “true pos¬ 
itive row ratio” or “false negative column ratio”, though 
there are conventional terms. There are thus two pairs 
of column ratios and two pairs of row ratios, and one can 
summarize these with four numbers by choosing one ratio 
from each pair - the other four numbers are the comple¬ 
ments. 

The column ratios are True Positive Rate (TPR, aka 
Sensitivity or recall), with complement the False Neg¬ 
ative Rate (FNR); and True Negative Rate (TNR, aka 
Specificity, SPC), with complement False Positive Rate 
(FPR). These are the proportion of the population with 
the condition (resp., without the condition) for which the 
test is correct (or, complementarity, for which the test is 
incorrect); these are independent of prevalence. 

The row ratios are Positive Predictive Value (PPV, aka 
precision), with complement the False Discovery Rate 
(FDR); and Negative Predictive Value (NPV), with com¬ 
plement the False Omission Rate (FOR). These are the 
proportion of the population with a given test result for 
which the test is correct (or, complementarity, for which 
the test is incorrect); these depend on prevalence. 

In diagnostic testing, the main ratios used are the true col¬ 
umn ratios - True Positive Rate and True Negative Rate - 
where they are known as sensitivity and specificity. In in¬ 
formational retrieval, the main ratios are the true positive 
ratios (row and column) - Positive Predictive Value and 
True Positive Rate - where they are known as precision 
and recall. 

One can take ratios of a complementary pair of ratios, 
yielding four likelihood ratios (two column ratio of ratios, 
two row ratio of ratios). This is primarily done for the col¬ 
umn (condition) ratios, yielding likelihood ratios in diag¬ 
nostic testing. Taking the ratio of one of these groups of 
ratios yields a final ratio, the diagnostic odds ratio (DOR). 
This can also be defined directly as (TPxTN)/(FPxFN) = 
(TP/FN)/(FP/TN); this has a useful interpretation - as an 
odds ratio - and is prevalence-independent. 

There are a number of other metrics, most simply the 
accuracy or Fraction Correct (FC), which measures the 
fraction of all instances that are correctly categorized; 
the complement is the Fraction Incorrect (FiC). The F- 
score combines precision and recall into one number via 
a choice of weighing, most simply equal weighing, as 
the balanced F-score (FI score). Some metrics come 
from regression coefficients: the markedness and the 
informedness, and their geometric mean, the Matthews 
correlation coefficient. Other metrics include Youden’s J 
statistic, the uncertainty coefficient, the Phi coefficient, 
and Cohen’s kappa. 


33.2 Converting continuous values 
to binary 

Tests whose results are of continuous values, such as most 
blood values, can artificially be made binary by defining a 
cutoff value, with test results being designated as positive 
or negative depending on whether the resultant value is 
higher or lower than the cutoff. 

However, such conversion causes a loss of information, as 
the resultant binary classification does not tell how much 
above or below the cutoff a value is. As a result, when 
converting a continuous value that is close to the cutoff to 
a binary one, the resultant positive or negative predictive 
value is generally higher than the predictive value given 
directly from the continuous value. In such cases, the 
designation of the test of being either positive or negative 
gives the appearance of an inappropriately high certainty, 
while the value is in fact in an interval of uncertainty. For 
example, with the urine concentration of hCG as a con¬ 
tinuous value, a urine pregnancy test that measured 52 
mlU/ml of hCG may show as “positive” with 50 mlU/ml 
as cutoff, but is in fact in an interval of uncertainty, which 
may be apparent only by knowing the original continuous 
value. On the other hand, a test result very far from the 
cutoff generally has a resultant positive or negative pre¬ 
dictive value that is lower than the predictive value given 
from the continuous value. For example, a urine hCG 
value of 200,000 mlU/ml confers a very high probability 
of pregnancy, but conversion to binary values results in 
that it shows just as “positive” as the one of 52 mlU/ml. 

33.3 See also 

• Examples of Bayesian inference 

• Classification rule 

• Detection theory 

• Kernel methods 

• Matthews correlation coefficient 

• Multiclass classification 

• Multi-label classification 

• One-class classification 

• Prosecutor’s fallacy 

• Receiver operating characteristic 

• Thresholding (image processing) 

• Type I and type II errors 

• Uncertainty coefficient, aka Proficiency 

• Qualitative property 
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Chapter 34 

Maximum likelihood 


This article is about the statistical techniques. For 
computer data storage, see Partial response maximum 
likelihood. 

In statistics, maximum-likelihood estimation (MLE) 

is a method of estimating the parameters of a statistical 
model. When applied to a data set and given a 
statistical model, maximum-likelihood estimation pro¬ 
vides estimates for the model’s parameters. 

The method of maximum likelihood corresponds to many 
well-known estimation methods in statistics. For exam¬ 
ple, one may be interested in the heights of adult fe¬ 
male penguins, but be unable to measure the height of 
every single penguin in a population due to cost or time 
constraints. Assuming that the heights are normally dis¬ 
tributed with some unknown mean and variance, the 
mean and variance can be estimated with MLE while only 
knowing the heights of some sample of the overall pop¬ 
ulation. MLE would accomplish this by taking the mean 
and variance as parameters and finding particular para¬ 
metric values that make the observed results the most 
probable given the model. 

In general, for a fixed set of data and underlying sta¬ 
tistical model, the method of maximum likelihood se¬ 
lects the set of values of the model parameters that maxi¬ 
mizes the likelihood function. Intuitively, this maximizes 
the “agreement” of the selected model with the observed 
data, and for discrete random variables it indeed max¬ 
imizes the probability of the observed data under the 
resulting distribution. Maximum-likelihood estimation 
gives a unified approach to estimation, which is well- 
defined in the case of the normal distribution and many 
other problems. However, in some complicated prob¬ 
lems, difficulties do occur: in such problems, maximum- 
likelihood estimators are unsuitable or do not exist. 


34.1 Principles 

Suppose there is a sample x-], x 2 , ..., xn of n independent 
and identically distributed observations, coming from a 
distribution with an unknown probability density func¬ 
tion /o( ). It is however surmised that the function f 0 


belongs to a certain family of distributions { /(-I 6), 0 e 
0 } (where 0 is a vector of parameters for this family), 
called the parametric model, so that /o = /( I Oq). The 
value 0 O is unknown and is referred to as the true value 
of the parameter vector. It is desirable to find an esti¬ 
mator 8 which would be as close to the true value 8q as 
possible. Either or both the observed variables xi and the 
parameter 0 can be vectors. 

To use the method of maximum likelihood, one first spec¬ 
ifies the joint density function for all observations. For an 
independent and identically distributed sample, this joint 
density function is 


f(x i,x 2 , ...,x n \9) = f{x 1 \9)xf(x 2 \8)x---xf(x n \8) 

Now we look at this function from a different perspective 
by considering the observed values x\, x 2 , ..., xn to be 
fixed “parameters” of this function, whereas 9 will be the 
function’s variable and allowed to vary freely; this func¬ 
tion will be called the likelihood: 


C(9 ■ xi,...,x n ) = f(x i,x 2 , ...,x n \9) = J\.f(xi | 9). 

i =1 

Note ; denotes a separation between the two input argu¬ 
ments: 9 and the vector-valued input xi ,..., x„ . 

In practice it is often more convenient to work with 
the logarithm of the likelihood function, called the log- 
likelihood: 


\nC(9 ; X\,. .., x n ) = ^1 n/(x, | 0), 

i=1 

or the average log-likelihood: 

1= — ln£. 

n 

The hat over £ indicates that it is akin to some estimator. 
Indeed, i estimates the expected log-likelihood of a single 
observation in the model. 
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The method of maximum likelihood estimates 9o by find¬ 
ing a value of 0 that maximizes l(9;x) . This method 
of estimation defines a maximum-likelihood estimator 
(MLE) of 0q ... 


34.2 Properties 

A maximum-likelihood estimator is an extremum estima¬ 
tor obtained by maximizing, as a function of 0, the objec¬ 
tive function (c.f., the loss function) 


{^mie} C {argmax 1(9 ; x x 
060 

... if any maximum exists. An MLE estimate is the same 
regardless of whether we maximize the likelihood or the 
log-likelihood function, since log is a strictly monotoni- 
cally increasing function. 

For many models, a maximum likelihood estimator can 
be found as an explicit function of the observed data x\, 
..., xn. For many other models, however, no closed- 
form solution to the maximization problem is known or 
available, and an MLE has to be found numerically us¬ 
ing optimization methods. For some problems, there may 
be multiple estimates that maximize the likelihood. For 
other problems, no maximum likelihood estimate exists 
(meaning that the log-likelihood function increases with¬ 
out attaining the supremum value). 

In the exposition above, it is assumed that the data are 
independent and identically distributed. The method can 
be applied however to a broader setting, as long as it is 
possible to write the joint density function f(x\, ..., xn I 
9), and its parameter 6 has a finite dimension which does 
not depend on the sample size n. In a simpler extension, 
an allowance can be made for data heterogeneity, so that 

the joint density is equal to fi(x\\6) ■ / 2 fel 0 ). fn(xn 

I 0). Put another way, we are now assuming that each 
observation x\ comes from a random variable that has its 
own distribution function f\. In the more complicated 
case of time series models, the independence assumption 
may have to be dropped as well. 

A maximum likelihood estimator coincides with the most 
probable Bayesian estimator given a uniform prior distri¬ 
bution on the parameters. Indeed, the maximum a poste¬ 
riori estimate is the parameter 9 that maximizes the prob¬ 
ability of 9 given the data, given by Bayes’ theorem: 


P{6 | Xi,X 2 , ...,x n ) 


f{x i,x 2 , ■ • ■ ,x n | 0)P{0) 
P{x 1 ,x 2 , ...,x n ) 


where P(9) is the prior distribution for the parameter 
9 and where P(x\,X 2 , ■ ■ ■, x n ) is the probability of the 
data averaged over all parameters. Since the denominator 
is independent of 9, the Bayesian estimator is obtained by 
maximizing f{x i, X 2 , ■ ■ ■, x n | 0)P{0) with respect to 9. 
If we further assume that the prior P(9) is a uniform dis¬ 
tribution, the Bayesian estimator is obtained by maximiz¬ 
ing the likelihood function f(xi,X2,---,x n \0) . Thus 
the Bayesian estimator coincides with the maximum- 
likelihood estimator for a uniform prior distribution P(6) 


1 

1(0 | x) = - ^2\nf(xi | 0), 

i—1 

this being the sample analogue of the expected log- 
likelihood £(9) = E[ln/(xj | 0) ] , where this expec¬ 
tation is taken with respect to the true density /(• | Of) 


Maximum-likelihood estimators have no optimum prop¬ 
erties for finite samples, in the sense that (when evaluated 
on finite samples) other estimators may have greater con¬ 
centration around the true parameter-value. 111 However, 
like other estimation methods, maximum-likelihood es¬ 
timation possesses a number of attractive limiting prop¬ 
erties: As the sample size increases to infinity, sequences 
of maximum-likelihood estimators have these properties: 

• Consistency: the sequence of MLEs converges in 
probability to the value being estimated. 

• Asymptotic normality: as the sample size increases, 
the distribution of the MLE tends to the Gaus¬ 
sian distribution with mean 9 and covariance matrix 
equal to the inverse of the Fisher information ma¬ 
trix. 

• Efficiency, i.e., it achieves the Cramer-Rao lower 
bound when the sample size tends to infinity. This 
means that no consistent estimator has lower asymp¬ 
totic mean squared error than the MLE (or other es¬ 
timators attaining this bound). 

• Second-order efficiency after correction for bias. 


34.2.1 Consistency 

Under the conditions outlined below, the maximum like¬ 
lihood estimator is consistent. The consistency means 
that having a sufficiently large number of observations n, 
it is possible to find the value of 9q with arbitrary preci¬ 
sion. In mathematical terms this means that as n goes to 
infinity the estimator e converges in probability to its true 
value: 


0mle 00 - 

Under slightly stronger conditions, the estimator con¬ 
verges almost surely (or strongly ) to: 


'mle 


a.s. 


'0- 
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To establish consistency, the following conditions are 
sufficient: 121 

1. Identification of the model: 

0*0 o ^ /(• \0)* /(• | 9 0 ). 

In other words, different parameter values 9 corre¬ 
spond to different distributions within the model. 
If this condition did not hold, there would be some 
value 0i such that 0 Q and 9i generate an identical 
distribution of the observable data. Then we 
wouldn't be able to distinguish between these two 
parameters even with an infinite amount of data — 
these parameters would have been observationally 
equivalent. 

The identification condition is absolutely necessary 
for the ML estimator to be consistent. When this 
condition holds, the limiting likelihood function 
/(0I-) has unique global maximum at Oo. 

2. Compactness: the parameter space © of the model 
is compact. The identification condition establishes 



that the log-likelihood has a unique global maxi¬ 
mum. Compactness implies that the likelihood can¬ 
not approach the maximum value arbitrarily close 
at some other point (as demonstrated for example in 
the picture on the right). 

Compactness is only a sufficient condition and not a 
necessary condition. Compactness can be replaced 
by some other conditions, such as: 

• both concavity of the log-likelihood function 
and compactness of some (nonempty) upper 
level sets of the log-likelihood function, or 

• existence of a compact neighborhood N of Go 
such that outside of N the log-likelihood func¬ 
tion is less than the maximum by at least some 
s > 0 . 

3. Continuity: the function In f(x\9) is continuous in 
9 for almost all values of x: 

Pr[ In f(x | 9) e C°(0) ] = 1. 

The continuity here can be replaced with a slightly 
weaker condition of upper semi-continuity. 


4. Dominance: there exists D(x) integrable with re¬ 
spect to the distribution f(x\9o) such that 

| In f(x | 9) | < D(x ) all for 9 £ 0. 

By the uniform law of large numbers, the domi¬ 
nance condition together with continuity establish 

the uniform convergence in probability of the log- 
likelihood: 


sup 1 1(9 | x) — 1(9) I A 0. 
see 

The dominance condition can be employed in the case 
of i.i.d. observations. In the non-i.i.d. case the uniform 
convergence in probability can be checked by showing 
that the sequence l(e\x) is stochastically equicontinuous. 

If one wants to demonstrate that the ML estimator § con¬ 
verges to 9 o almost surely, then a stronger condition of 

uniform convergence almost surely has to be imposed: 

sup || l(x | 9) — 1(9) || 0. 

see 

34.2.2 Asymptotic normality 

Maximum-likelihood estimators can lack asymptotic 
normality and can be inconsistent if there is a failure of 
one (or more) of the below regularity conditions: 

Estimate on boundary. Sometimes the maximum like¬ 
lihood estimate lies on the boundary of the set of possible 
parameters, or (if the boundary is not, strictly speaking, 
allowed) the likelihood gets larger and larger as the pa¬ 
rameter approaches the boundary. Standard asymptotic 
theory needs the assumption that the true parameter value 
lies away from the boundary. If we have enough data, 
the maximum likelihood estimate will keep away from 
the boundary too. But with smaller samples, the estimate 
can lie on the boundary. In such cases, the asymptotic 
theory clearly does not give a practically useful approx¬ 
imation. Examples here would be variance-component 
models, where each component of variance, a 2 , must sat¬ 
isfy the constraint a 2 >0. 

Data boundary parameter-dependent. For the theory 
to apply in a simple way, the set of data values which 
has positive probability (or positive probability density) 
should not depend on the unknown parameter. A sim¬ 
ple example where such parameter-dependence does hold 
is the case of estimating 0 from a set of independent 
identically distributed when the common distribution is 
uniform on the range (0,0). For estimation purposes the 
relevant range of 0 is such that 0 cannot be less than 
the largest observation. Because the interval (0,0) is not 
compact, there exists no maximum for the likelihood 
function: For any estimate of theta, there exists a greater 
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estimate that also has greater likelihood. In contrast, the 
interval [0,0] includes the end-point 0 and is compact, 
in which case the maximum-likelihood estimator exists. 
However, in this case, the maximum-likelihood estima¬ 
tor is biased. Asymptotically, this maximum-likelihood 
estimator is not normally distributed. 131 

Nuisance parameters. For maximum likelihood esti¬ 
mations, a model may have a number of nuisance param¬ 
eters. For the asymptotic behaviour outlined to hold, the 
number of nuisance parameters should not increase with 
the number of observations (the sample size). A well- 
known example of this case is where observations occur 
as pairs, where the observations in each pair have a dif¬ 
ferent (unknown) mean but otherwise the observations 
are independent and normally distributed with a common 
variance. Here for 2 N observations, there are N+ 1 pa¬ 
rameters. It is well known that the maximum likelihood 
estimate for the variance does not converge to the true 
value of the variance. 

Increasing information. For the asymptotics to hold 
in cases where the assumption of independent identically 
distributed observations does not hold, a basic require¬ 
ment is that the amount of information in the data in¬ 
creases indefinitely as the sample size increases. Such a 
requirement may not be met if either there is too much 
dependence in the data (for example, if new observations 
are essentially identical to existing observations), or if 
new independent observations are subject to an increas¬ 
ing observation error. 

Some regularity conditions which ensure this behavior 
are: 

1. The first and second derivatives of the log-likelihood 
function must be defined. 

2. The Fisher information matrix must not be zero, and 
must be continuous as a function of the parameter. 

3. The maximum likelihood estimator is consistent. 

Suppose that conditions for consistency of maximum 
likelihood estimator are satisfied, and 141 

1. 6q e interior)©); 

2. f(x I 0) > 0 and is twice continuously differentiable 
in 0 in some neighborhood A of 6o', 

3. J sup0eAIIV0/(x I 0)llcbr < oo, and J 
sup0eAIIV00/Od0)llck < oo; 

4. I = E[V01n/(;t I 0q) V6\af(x\6o)’] exists and is non¬ 
singular; 

5. E[ sup<9eAIIV001n/(;r I 0)11] < oo. 

Then the maximum likelihood estimator has asymptoti¬ 
cally normal distribution: 


vM4nle-0o) AAf(0, I" 1 ). 

Proof, skipping the technicalities : 

Since the log-likelihood function is differentiable, and do 
lies in the interior of the parameter set, in the maximum 
the first-order condition will be satisfied: 


1 

V fl *(0|aO = -y'V e ln/(a: i |0) = O. 
n 

i= 1 

When the log-likelihood is twice differentiable, this ex¬ 
pression can be expanded into a Taylor series around the 
point 6 = 6 0 : 


1 

0 = -T Vein f(xi I 0 O )+ 
n 

i =1 


1 \ 

- Y Vee In f(xi \ 9) 


(0-9 0 ), 


where 9 is some point intermediate between 0 (l and 9 . 
From this expression we can derive that 


Vn(9-9 0 ) 


1 ' , 

-- Y Veeln/Oi | 9) 

i=i 



Y V 0 ln/(xj | 6> 0 ) 

i=i 


Here the expression in square brackets converges in prob¬ 
ability to H = E[-V001n f(x 10 O )] by the law of large num¬ 
bers. The continuous mapping theorem ensures that the 
inverse of this expression also converges in probability, to 
H~ ] . The second sum, by the central limit theorem, con¬ 
verges in distribution to a multivariate normal with mean 
zero and variance matrix equal to the Fisher information 
I. Thus, applying Slutsky’s theorem to the whole expres¬ 
sion, we obtain that 


y/n{0-0 Q ) 4 Af(0, iT" 1 /#" 1 ). 

Finally, the information equality guarantees that when the 
model is correctly specified, matrix H will be equal to 
the Fisher information I, so that the variance expression 
simplifies to just I~ l . 


34.2.3 Functional invariance 

The maximum likelihood estimator selects the parame¬ 
ter value which gives the observed data the largest possi¬ 
ble probability (or probability density, in the continuous 
case). If the parameter consists of a number of compo¬ 
nents, then we define their separate maximum likelihood 
estimators, as the corresponding component of the MLE 
of the complete parameter. Consistent with this, if 9 is 
the MLE for 0, and if g(6) is any transformation of 6, 
then the MLE for a = g(B) is by definition 
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a = g{0). 

It maximizes the so-called profile likelihood: 

L(a) = sup L(0). 

9:ct=g{6) 

The MLE is also invariant with respect to certain trans¬ 
formations of the data. If Y = g(X) where g is one to one 
and does not depend on the parameters to be estimated, 
then the density functions satisfy 

fv(y) = fx(x)/\g\x)\ 

and hence the likelihood functions for X and Y differ only 
by a factor that does not depend on the model parameters. 

For example, the MLE parameters of the log-normal dis¬ 
tribution are the same as those of the normal distribution 
fitted to the logarithm of the data. 


L le = Lie - 1 

This estimator is unbiased up to the terms of order n~ l , 
and is called the bias-corrected maximum likelihood 
estimator. 

This bias-corrected estimator is second-order efficient (at 
least within the curved exponential family), meaning that 
it has minimal mean squared error among all second- 
order bias-corrected estimators, up to the terms of the 
order n~ 2 . It is possible to continue this process, that is 
to derive the third-order bias-correction term, and so on. 
However as was shown by Kano (1996), the maximum- 
likelihood estimator is not third-order efficient. 


34.3 Examples 

34.3.1 Discrete uniform distribution 

Main article: German tank problem 


34.2.4 Higher-order properties 

The standard asymptotics tells that the maximum- 
likelihood estimator is Vn-consistent and asymptotically 
efficient, meaning that it reaches the Cramer-Rao bound: 


x/iOLe-0o) ^ AA(0, r 1 ), 

where I is the Fisher information matrix: 


Ijk = E-V 


d 2 In fg 0 (X t ) 
d6j ddk 


In particular, it means that the bias of the maximum- 
likelihood estimator is equal to zero up to the order n~ 112 . 
However when we consider the higher-order terms in the 
expansion of the distribution of this estimator, it turns 
out that (9 m i e has bias of order n~ l . This bias is equal to 
(componentwise) 151 


Consider a case where n tickets numbered from 1 to n 
are placed in a box and one is selected at random {see 
uniform distribution)-, thus, the sample size is 1. If n is 
unknown, then the maximum-likelihood estimator n of 
n is the number m on the drawn ticket. (The likelihood 
is 0 for n < m, 1 in for n > m, and this is greatest when 
n = m. Note that the maximum likelihood estimate of 
n occurs at the lower extreme of possible values { m, m 
+ 1, ...}, rather than somewhere in the “middle” of the 
range of possible values, which would result in less bias.) 
The expected value of the number m on the drawn ticket, 
and therefore the expected value of n , is (n + l)/2. As a 
result, with a sample size of 1, the maximum likelihood 
estimator for n will systematically underestimate n by (n 
- l)/2. 


34.3.2 Discrete distribution, finite param¬ 
eter space 


= E[(0 mle - e Q ) s ] = - • i si p k (\K l]k + j j>ik ) 


where Einstein’s summation convention over the repeat¬ 
ing indices has been adopted; / ' k denotes the j,k -th com¬ 
ponent of the inverse Fisher information matrix I~ l , and 


Suppose one wishes to determine just how biased an 
unfair coin is. Call the probability of tossing a HEAD 
p. The goal then becomes to determine p. 

Suppose the coin is tossed 80 times: i.e., the sample 
might be something like x\ = H, x 2 = T, ..., y s(i = T, 
and the count of the number of HEADS “H” is observed. 


2 k. T Jj,ik — E 


1 d 3 ln fg 0 (x t ) 

2 ddi d6j dd k 


The probability of tossing TAILS is 1 - p (so here p is 
0 above). Suppose the outcome is 49 HEADS and 31 
g In j g j q 2 ^ j^T^UySi, and suppose the coin was taken from a box con- 
11 three coins: one which gives HEADS with prob- 


dOj 


d9. 


Using these formulas it is possible to estimate the second- 
order bias of the maximum likelihood estimator, and cor¬ 
rect for that bias by subtracting it: 


mmg 

amlity'? = 1/3, one which gives HEADS with probability 
p = 1/2 and another which gives HEADS with probabil¬ 
ity p = 2/3. The coins have lost their labels, so which 
one it was is unknown. Using maximum likelihood 
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estimation the coin that has the largest likelihood can 
be found, given the data that were observed. By using 
the probability mass function of the binomial distribution 
with sample size equal to 80, number successes equal to 
49 but different values of p (the “probability of success”), 
the likelihood function (defined below) takes one of three 
values: 


oc 49p 48 (l - p ) 31 - 31p 49 (l - p) 30 
= p 48 (l-p) 30 [49(1 — p) — 31p] 


Pr(H = 49 | p = 1/3) = (1/3) 49 (1 - 1/3) 31 

Pr(H = 49 | p= 1/2) = (1/2) 49 (1 - 1/2) 31 

Pr(H = 49 | p = 2/3) = (2/3) 49 (l - 2/3) 31 

The likelihood is maximized when p = 2/3, and so this is 
the maximum likelihood estimate for p. 


= p 48 (l -p) 30 [49- 80p] 

®'*Mich has solutionsp = 0, p = 1, and p = 49/80. The solu¬ 
tion which maximizes the likelihood is clearly p = 49/80 
II Isjijce p = 0 and p = 1 result in a likelihood of zero). Thus 
the maximum likelihood estimator for p is 49/80. 

This result is easily generalized by substituting a letter 
0-QMh as t in the place of 49 to represent the observed 
number of 'successes’ of our Bernoulli trials, and a let¬ 
ter such as n in the place of 80 to represent the number 
of Bernoulli trials. Exactly the same calculation yields the 
maximum likelihood estimator tin for any sequence of n 
Bernoulli trials resulting in t 'successes’. 


34.3.3 Discrete distribution, continuous 
parameter space 


34.3.4 Continuous distribution, continu¬ 
ous parameter space 


Now suppose that there was only one coin but its p could 
have been any value 0 < p < 1. The likelihood function 
to be maximised is 


For the normal distribution J\T(p,a 2 ) which has 
probability density function 


/<I|ft< ’ 2) = 7ib exp (- i! y iL )' 

the corresponding probability density function for a sam¬ 
ple of n independent identically distributed normal ran¬ 
dom variables (the likelihood) is 

and the maximisation is over all possible values 0 < p < 

1 . 


L(P ) = f D (H = 49 | p) = Q°)p 49 ( 1 - p) 31 , 


likelihood function for proportion value of a binomial process (n=10) 



0 0.2 0.4 0.6 0.8 1 


likelihood function for proportion value of a binomial process (n 
= 10 ) 

One way to maximize this function is by differentiating 
with respect to p and setting to zero: 


71 / ^ \ 1 

f(x 1, | p,CT 2 ) = [I f(Xi | p, <7 2 ) = f -^ ) 

i= 1 ' 77(7 ' 


or more conveniently: 


I,,^)=(JL)" /2 exp(-S4(5 

where x is the sample mean. 

This family of distributions has two parameters: 0 = 
(u, a), so we maximize the likelihood, C(p, a) = 
f(x i,..., x n | p, a) , over both parameters simultane¬ 
ously, or if possible, individually. 

Since the logarithm is a continuous strictly increasing 
function over the range of the likelihood, the values which 
maximize the likelihood will also maximize its logarithm. 
This log likelihood can be written as follows: 
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1 _ _ 

log(£(/i,cr)) = (-n/2)log(27rcr 2 )-^^(x i -/x) 2 

i =1 

(Note: the log-likelihood is closely related to information 
entropy and Fisher information.) 

We now compute the derivatives of this log likelihood as 
follows. 


(statistical error) Si = n — Xi . Expressing the estimate 
in these variables yields 

1 n 1 n n 

a 2 = - - S i) 2 ~ — “ 6 i)- 

n ' n z z —' z ' 

i =1 i —1 ,7=1 

Simplifying the expression above, utilizing the facts that 
E [5i] =0 and E[5 2 ] = a 2 , allows us to obtain 


o = A log{£M) = 0 - ^ 

This is solved by 


E [a 2 ] = —--cr 2 . 

n 

This means that the estimator a is biased. However, a is 
consistent. 


n 



Formally we say that the maximum likelihood estimator 
for 9 = (/x, a 2 ) is: 


This is indeed the maximum of the function since it is 
the only turning point in p and the second derivative is 
strictly less than zero. Its expectation value is equal to the 
parameter p of the given distribution. 


0 = . 

In this case the MLEs could be obtained individually. In 
general this may not be the case, and the MLEs would 
have to be obtained simultaneously. 


E [/x] = /x, 


The normal log likelihood at its maximum takes a partic¬ 
ularly simple form: 


which means that the maximum-likelihood estimator /x is 
unbiased. 

Similarly we differentiate the log likelihood with respect 
to a and equate to zero: 


0 = 


do 


d 


log 


27TCT 2 


i/2 


exp 


x—' t — \2 i / -ing It, 

Ei=! {Xj-x) +n (x^ A II 

2 <7 2 


log (£(£,o-)) = ^(log(27r<f 2 ) + 1) 

This maximum log likelihood can be shown to be the 
same for more general least squares, even for non¬ 
linear least squares. This is often used in determin- 
od-based approximate confidence intervals 
nee regions, which are generally more accu- 
ose using the asymptotic normality discussed 


= £toU‘ 0g 


n , 


27RT 2 


YJj=l( x i - x ) 2 + n ( x - pI 

2 CT 2 



= _n ELi(^ - x ) 2 + n ( x - M) 2 

a (7 3 

which is solved by 


^X>-m) 2 - 


34.4 Non-independent variables 

It may be the case that variables are correlated, that is, 
not independent. Two random variables X and Y are in¬ 
dependent only if their joint probability density function 
is the product of the individual probability density func¬ 
tions, i.e. 


Inserting the estimate /x = /x we obtain 


f{x,v) = f( x )f(y) 


i n 


i= 1 


1 

n 




Suppose one constructs an order-;; Gaussian vector out 
X Ji, of random variables (xi ,..., x n ) , where each variable 

^2 Z_^ 2 E x i x :i- has means given by (/xi,..., /i n ) . Furthermore, let the 

*=i j=i covariance matrix be denoted by E . 


To calculate its expected value, it is convenient to rewrite The joint probability density function of these n random 
the expression in terms of zero-mean random variables variables is then given by: 
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• communication systems; 


f(x i, ■■■,x n ) = 


(27r)"/Vdet(£) 


exp 


[*1 - Ml, • ■ ■ - Ml, - Mnf) 


In the two variable case, the joint probability density 
function is given by: 


• econometrics; 

• time-delay of arrival (TDOA) in acoustic or electro¬ 
magnetic detection; 


f(x,y) = - - n -~ ex P on 21 

2n<j x a y y/l - p 2 [ 2 (1 - P ) 


., . „ • data modeling in nuclear and particle physics; 

1 _ ( (x-Px) 2 _ Mx - p x )fy - n v ) + {y -HyTX 

er 2 • magneticTEespnance imaging 101 J 


In this and other cases where a joint density function ex¬ 
ists, the likelihood function is defined as above, in the 
section Principles, using this density. 


• computational phylogenetics; 

• origin/destination and path-choice modeling in 
transport networks; 


34.5 Iterative procedures 

Consider problems where both states X{ and parame¬ 
ters such as cr 2 require to be estimated. Iterative proce¬ 
dures such as Expectation-maximization algorithms may 
be used to solve joint state-parameter estimation prob¬ 
lems. 

For example, suppose that n samples of state estimates x, 
together with a sample mean x have been calculated by 
either a minimum-variance Kalman filter or a minimum- 
variance smoother using a previous variance estimate a 2 
. Then the next variance iterate may be obtained from the 
maximum likelihood estimate calculation 


1 n 

a 2 = - y^ixi - x) 2 . 

i =1 


• geographical satellite-image classification. 

• power system state estimation 

34.7 History 

Maximum-likelihood estimation was recommended, ana¬ 
lyzed (with flawed attempts at proofs) and vastly popular¬ 
ized by R. A. Fisher between 1912 and 1922 1 111 (although 
it had been used earlier by Gauss, Faplace, T. N. Thiele, 
and F. Y. Edgeworth). 1121 Reviews of the development of 
maximum likelihood have been provided by a number of 
authors. 1 131 

Much of the theory of maximum-likelihood estimation 
was first developed for Bayesian statistics, and then sim¬ 
plified by later authors. 11 11 


The convergence of MFEs within filtering and smoothing 
EM algorithms are studied in 16 ' 171 . |8 ' 

34.6 Applications 

Maximum likelihood estimation is used for a wide range 
of statistical models, including: 

• linear models and generalized linear models; 

• exploratory and confirmatory factor analysis; 

• structural equation modeling; 

• many situations in the context of hypothesis testing 
and confidence interval \ 

• discrete choice models; 

These uses arise across applications in widespread set of 
fields, including: 


34.8 See also 

• Other estimation methods 

• Generalized method of moments are methods 
related to the likelihood equation in maximum 
likelihood estimation. 

• M-estimator, an approach used in robust 
statistics. 

• Maximum a posteriori (MAP) estimator, for 
a contrast in the way to calculate estimators 
when prior knowledge is postulated. 

• Maximum spacing estimation, a related 
method that is more robust in many situations. 

• Method of moments (statistics), another pop¬ 
ular method for finding parameters of distri¬ 
butions. 

• Method of support, a variation of the maxi¬ 
mum likelihood technique. 

• Minimum distance estimation 
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• Quasi-maximum likelihood estimator, an 
MLE estimator that is misspecified, but still 
consistent. 

• Restricted maximum likelihood, a variation 
using a likelihood function calculated from a 
transformed set of data. 

• Related concepts: 

• The BHHH algorithm is a non-linear opti¬ 
mization algorithm that is popular for Maxi¬ 
mum Likelihood estimations. 

• Extremum estimator, a more general class of 
estimators to which MLE belongs. 

• Fisher information, information matrix, its re¬ 
lationship to covariance matrix of ML esti¬ 
mates 

• Likelihood function, a description on what 
likelihood functions are. 

• Mean squared error, a measure of how 'good' 
an estimator of a distributional parameter is 
(be it the maximum likelihood estimator or 
some other estimator). 

• The Rao-Blackwell theorem, a result which 
yields a process for finding the best possi¬ 
ble unbiased estimator (in the sense of having 
minimal mean squared error). The MLE is of¬ 
ten a good starting place for the process. 

• Sufficient statistic, a function of the data 
through which the MLE (if it exists and is 
unique) will depend on the data. 
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Chapter 35 

Linear classifier 


In the field of machine learning, the goal of statistical 
classification is to use an object’s characteristics to iden¬ 
tify which class (or group) it belongs to. A linear clas¬ 
sifier achieves this by making a classification decision 
based on the value of a linear combination of the char¬ 
acteristics. An object’s characteristics are also known as 
feature values and are typically presented to the machine 
in a vector called a feature vector. Such classifiers work 
well for practical problems such as document classifica¬ 
tion, and more generally for problems with many vari¬ 
ables (features), reaching accuracy levels comparable to 
non-linear classifiers while taking less time to train and 


35.1 Definition 



In this case, the solid and empty dots can be correctly classified 
by any number of linear classifiers. HI (blue) classifies them 
correctly, as does H2 (red). H2 could be considered “better” in 
the sense that it is also furthest from both groups. H3 (green) 
fails to correctly classify the dots. 


y = f(w-x) 



where w is a real vector of weights and / is a function 
that converts the dot product of the two vectors into the 
desired output. (In other words, u> is a one-form or linear 
functional mapping x onto R.) The weight vector w is 
learned from a set of labeled training samples. Often / 
is a simple function that maps all values above a certain 
threshold to the first class and all other values to the sec¬ 
ond class. A more complex / might give the probability 
that an item belongs to a certain class. 

For a two-class classification problem, one can visual¬ 
ize the operation of a linear classifier as splitting a high¬ 
dimensional input space with a hyperplane: all points on 
one side of the hyperplane are classified as “yes”, while 
the others are classified as “no”. 

A linear classifier is often used in situations where the 
speed of classification is an issue, since it is often the 
fastest classifier, especially when x is sparse. Also, lin¬ 
ear classifiers often work very well when the number of 
dimensions in x is large, as in document classification, 
where each element in x is typically the number of oc¬ 
currences of a word in a document (see document-term 
matrix). In such cases, the classifier should be well- 
regularized. 


35.2 Generative models vs. dis¬ 
criminative models 

There are two broad classes of methods for determin¬ 
ing the parameters of a linear classifier w .® [31 Meth¬ 
ods of the first class model conditional density functions 
P(x|class) . Examples of such algorithms include: 

• Linear Discriminant Analysis (or Fisher’s linear dis¬ 
criminant) (LDA)—assumes Gaussian conditional 
density models 


If the input feature vector to the classifier is a real vector 
x , then the output score is 
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• Naive Bayes classifier with multinomial or multi¬ 
variate Bernoulli event models. 
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The second set of methods includes discriminative mod¬ 
els, which attempt to maximize the quality of the out¬ 
put on a training set. Additional terms in the training 
cost function can easily perform regularization of the fi¬ 
nal model. Examples of discriminative training of linear 
classifiers include 

• Logistic regression —maximum likelihood estima¬ 
tion of w assuming that the observed training set was 
generated by a binomial model that depends on the 
output of the classifier. 

• Perception— an algorithm that attempts to fix all er¬ 
rors encountered in the training set 

• Support vector machine— an algorithm that maxi¬ 
mizes the margin between the decision hyperplane 
and the examples in the training set. 

Note: Despite its name, LDA does not belong to the 
class of discriminative models in this taxonomy. How¬ 
ever, its name makes sense when we compare LDA 
to the other main linear dimensionality reduction algo¬ 
rithm: principal components analysis (PCA). LDA is a 
supervised learning algorithm that utilizes the labels of 
the data, while PCA is an unsupervised learning algo¬ 
rithm that ignores the labels. To summarize, the name 
is a historical artifact. |4|:117 

Discriminative training often yields higher accuracy than 
modeling the conditional density functions. However, 
handling missing data is often easier with conditional 
density models. 

All of the linear classifier algorithms listed above can be 
converted into non-linear algorithms operating on a dif¬ 
ferent input space ip(x) , using the kernel trick. 

35.2.1 Discriminative training 

Discriminative training of linear classifiers usually pro¬ 
ceeds in a supervised way, by means of an optimization 
algorithm that is given a training set with desired outputs 
and a loss function that measures the discrepancy between 
the classifier’s outputs and the desired outputs. Thus, the 
learning algorithm solves an optimization problem of the 
form |1] 


N 

argmin R(v?) + CV L(y it w T Xi) 

W Z-/ 

i=1 

where 

• w are the classifier’s parameters, 

• L(yi, w T x/) is the loss of the prediction given the 
desired output y* for the i'th training example, 


• R( w) is a regularization term that prevents the pa¬ 
rameters from getting too large (causing overfitting), 
and 

• C is some constant (set by the user of the learning 
algorithm) that weighs the regularization against the 
loss. 

Popular loss functions include the hinge loss (for linear 
SVMs) and the log loss (for linear logistic regression). If 
the regularization function R is convex, then the above 
is a convex problem. 111 Many algorithms exist for solving 
such problems; popular ones for linear classification in¬ 
clude (stochastic) gradient descent, L-BLGS, coordinate 
descent and Newton methods. 


35.3 See also 

• Linear regression 

• Winnow (algorithm) 

• Quadratic classifier 

• Support vector machines 
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Chapter 36 

Logistic regression 


In statistics, logistic regression, or logit regression, or 
logit model 1 11 is a direct probability model that was de¬ 
veloped by statistician D. R. Cox in 1958 121 121 although 
much work was done in the single independent vari¬ 
able case almost two decades earlier. The binary logis¬ 
tic model is used to predict a binary response based on 
one or more predictor variables (features). That is, it 
is used in estimating the parameters of a qualitative re¬ 
sponse model. The probabilities describing the possi¬ 
ble outcomes of a single trial are modeled, as a function 
of the explanatory (predictor) variables, using a logistic 
function. Frequently (and hereafter in this article) “logis¬ 
tic regression” is used to refer specifically to the prob¬ 
lem in which the dependent variable is binary— that is, 
the number of available categories is two—while prob¬ 
lems with more than two categories are referred to as 
multinomial logistic regression, or, if the multiple cate¬ 
gories are ordered, as ordinal logistic regression. 131 

Logistic regression measures the relationship between the 
categorical dependent variable and one or more indepen¬ 
dent variables, which are usually (but not necessarily) 
continuous, by estimating probabilities. Thus, it treats the 
same set of problems as does probit regression using sim¬ 
ilar techniques; the first assumes a logistic function and 
the second a standard normal distribution function. 

Logistic regression can be seen as a special case of 
generalized linear model and thus analogous to linear re¬ 
gression. The model of logistic regression, however, is 
based on quite different assumptions (about the relation¬ 
ship between dependent and independent variables) from 
those of linear regression. In particular the key differ¬ 
ences of these two models can be seen in the following 
two features of logistic regression. First, the conditional 
distribution p(y \ x) is a Bernoulli distribution rather 
than a Gaussian distribution, because the dependent vari¬ 
able is binary. Second, the estimated probabilities are 
restricted to [0,1] through the logistic distribution func¬ 
tion because logistic regression predicts the probability 
of the instance being positive. 

Logistic regression is an alternative to Fisher’s 1936 clas¬ 
sification method, linear discriminant analysis. 141 If the 
assumptions of linear discriminant analysis hold, appli¬ 
cation of Bayes’ rule to reverse the conditioning results 
in the logistic model, so if linear discriminant assump¬ 


tions are true, logistic regression assumptions must hold. 
The converse is not true, so the logistic model has fewer 
assumptions than discriminant analysis and makes no as¬ 
sumption on the distribution of the independent variables. 


36.1 Fields and example applica¬ 
tions 

Logistic regression is used widely in many fields, in¬ 
cluding the medical and social sciences. For example, 
the Trauma and Injury Severity Score (TRISS), which 
is widely used to predict mortality in injured patients, 
was originally developed by Boyd et al. using logistic 
regression. 131 Many other medical scales used to assess 
severity of a patient have been developed using logis¬ 
tic regression. 161171181191 Logistic regression may be used 
to predict whether a patient has a given disease (e.g. 
diabetes; coronary heart disease), based on observed 
characteristics of the patient (age, sex, body mass in¬ 
dex, results of various blood tests, etc.; age, blood choles¬ 
terol level, systolic blood pressure, relative weight, blood 
hemoglobin level, smoking (at 3 levels), and abnormal 
electrocardiogram.). 1111101 Another example might be to 
predict whether an American voter will vote Democratic 
or Republican, based on age, income, sex, race, state of 
residence, votes in previous elections, etc. 1111 The tech¬ 
nique can also be used in engineering, especially for pre¬ 
dicting the probability of failure of a given process, sys¬ 
tem or product. 1 1211131 It is also used in marketing applica¬ 
tions such as prediction of a customer’s propensity to pur¬ 
chase a product or halt a subscription, etc. In economics it 
can be used to predict the likelihood of a person’s choos¬ 
ing to be in the labor force, and a business application 
would be to predict the likelihood of a homeowner de¬ 
faulting on a mortgage. Conditional random fields, an ex¬ 
tension of logistic regression to sequential data, are used 
in natural language processing. 
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36.2 Basics 


Logistic regression can be binomial or multinomial. Bi¬ 
nomial or binary logistic regression deals with situations 
in which the observed outcome for a dependent variable 
can have only two possible types (for example, “dead” 
vs. “alive” or “win” vs. “loss”). Multinomial logistic re¬ 
gression deals with situations where the outcome can have 
three or more possible types (e.g., “disease A” vs. “dis¬ 
ease B” vs. “disease C”). In binary logistic regression, 
the outcome is usually coded as “0” or “1”, as this leads 
to the most straightforward interpretation. 1141 If a partic¬ 
ular observed outcome for the dependent variable is the 
noteworthy possible outcome (referred to as a “success” 
or a “case”) it is usually coded as “1” and the contrary out¬ 
come (referred to as a “failure” or a “noncase”) as “0”. 
Logistic regression is used to predict the odds of being 
a case based on the values of the independent variables 
(predictors). The odds are defined as the probability that 
a particular outcome is a case divided by the probability 
that it is a noncase. 

Like other forms of regression analysis, logistic regres¬ 
sion makes use of one or more predictor variables that 
may be either continuous or categorical data. Unlike ordi¬ 
nary linear regression, however, logistic regression is used 
for predicting binary outcomes of the dependent vari¬ 
able (treating the dependent variable as the outcome of a 
Bernoulli trial) rather than a continuous outcome. Given 
this difference, it is necessary that logistic regression take 
the natural logarithm of the odds of the dependent vari¬ 
able being a case (referred to as the logit or log-odds) 
to create a continuous criterion as a transformed version 
of the dependent variable. Thus the logit transformation 
is referred to as the link function in logistic regression— 
although the dependent variable in logistic regression is 
binomial, the logit is the continuous criterion upon which 
linear regression is conducted. 1141 

The logit of success is then fitted to the predictors using 
linear regression analysis. The predicted value of the logit 
is converted back into predicted odds via the inverse of 
the natural logarithm, namely the exponential function. 
Thus, although the observed dependent variable in logis¬ 
tic regression is a zero-or-one variable, the logistic regres¬ 
sion estimates the odds, as a continuous variable, that the 
dependent variable is a success (a case). In some applica¬ 
tions the odds are all that is needed. In others, a specific 
yes-or-no prediction is needed for whether the dependent 
variable is or is not a case; this categorical prediction can 
be based on the computed odds of a success, with pre¬ 
dicted odds above some chosen cutoff value being trans¬ 
lated into a prediction of a success. 



Figure 1. The logistic function o(t) ; note that a(t) 6 [0,1] for 
all t. 

36.3 Logistic function, odds, odds 
ratio, and logit 

36.3.1 Definition of the logistic function 

An explanation of logistic regression begins with an ex¬ 
planation of the logistic function. The logistic function is 
useful because it can take an input with any value from 
negative to positive infinity, whereas the output always 
takes values between zero and one 1141 and hence is inter¬ 
pretable as a probability. The logistic function a(t ) is 
defined as follows: 



1 

1 + e~* ’ 


A graph of the logistic function is shown in Figure 1. 

If t is viewed as a linear function of an explanatory vari¬ 
able x (or of a linear combination of explanatory vari¬ 
ables), then we express t as follows: 


t = Po + PlX 


And the logistic function can now be written as: 


F{x) 


1 

1 -(- e -(^o+/3i®) 


Note that F(x) is interpreted as the probability of the 
dependent variable equaling a “success” or “case” rather 
than a failure or non-case. It’s clear that the response vari¬ 
ables Yi are not identically distributed: P( Y t = 1 | X) 
differs from one data point Xi to another, though they 
are independent given design matrix X and shared with 
parameters (3 J 11 
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36.3.2 Definition of the inverse of the logis¬ 
tic function 

We can now define the inverse of the logistic function, g 
, the logit (log odds): 

g(F{x)) = In = 0 O + 0ix, 

1 — r (x) 

and equivalently: 

F ( X ) = Jlo+PlX 

1 - F{x) 

36.3.3 Interpretation of these terms 

In the above equations, the terms are as follows: 

• g(-) refers to the logit function. The equation for 
g(F(x)) illustrates that the logit (i.e., log-odds or 
natural logarithm of the odds) is equivalent to the 
linear regression expression. 

• In denotes the natural logarithm. 

• F(x) is the probability that the dependent variable 
equals a case, given some linear combination x of 
the predictors. The formula for F(x) illustrates that 
the probability of the dependent variable equaling a 
case is equal to the value of the logistic function of 
the linear regression expression. This is important 
in that it shows that the value of the linear regres¬ 
sion expression can vary from negative to positive 
infinity and yet, after transformation, the resulting 
expression for the probability F(x) ranges between 
0 and 1. 

• 0o is the intercept from the linear regression equa¬ 
tion (the value of the criterion when the predictor is 
equal to zero). 

• 0ix is the regression coefficient multiplied by some 
value of the predictor. 

• base e denotes the exponential function. 

36.3.4 Definition of the odds 

The odds of the dependent variable equaling a case (given 
some linear combination x of the predictors) is equiva¬ 
lent to the exponential function of the linear regression 
expression. This illustrates how the logit serves as a link 
function between the probability and the linear regres¬ 
sion expression. Given that the logit ranges between neg¬ 
ative and positive infinity, it provides an adequate crite¬ 
rion upon which to conduct linear regression and the logit 
is easily converted back into the odds. 1141 


So we define odds of the dependent variable equaling a 
case (given some linear combination x of the predictors) 
as follows: 

odds = e po+0lX . 

36.3.5 Definition of the odds ratio 

The odds ratio can be defined as: 

-FQr+l) 

OR = odds(x+l)/odds(x) = = e P°+PR x+1 ) / 

1 -Hx) 

or for binary variable F(0) instead of F(x) and F(l) for 
F(x+1). This exponential relationship provides an inter¬ 
pretation for 0i : The odds multiply by e i4 ' for every 1- 
unit increase in x. 1151 

36.3.6 Multiple explanatory variables 

If there are multiple explanatory variables, the above ex¬ 
pression 0o + 0ix can be revised to 0 O + 0\X\ + 0iXi + 

■ • • + 0mX m - Then when this is used in the equation re¬ 
lating the logged odds of a success to the values of the 
predictors, the linear regression will be a multiple regres¬ 
sion with m explanators; the parameters 0j for all j = 0, 

1, 2 ,m are all estimated. 

36.4 Model fitting 

36.4.1 Estimation 

Because the model can be expressed as a generalized lin¬ 
ear model (see below), for 0<p<l, ordinary least squares 
can suffice, with R-squared as the measure of goodness 
of fit in the fitting space. When p=0 or 1, more complex 
methods are required. 

Maximum likelihood estimation 

The regression coefficients are usually estimated using 
maximum likelihood estimation. 1161 Unlike linear regres¬ 
sion with normally distributed residuals, it is not possible 
to find a closed-form expression for the coefficient values 
that maximize the likelihood function, so that an itera¬ 
tive process must be used instead; for example Newton’s 
method. This process begins with a tentative solution, re¬ 
vises it slightly to see if it can be improved, and repeats 
this revision until improvement is minute, at which point 
the process is said to have converged. 1171 

In some instances the model may not reach convergence. 
Nonconvergence of a model indicates that the coefficients 


e Po+@lX _ e 0! 
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are not meaningful because the iterative process was un¬ 
able to find appropriate solutions. A failure to converge 
may occur for a number of reasons: having a large ra¬ 
tio of predictors to cases, multicollinearity, sparseness, 
or complete separation. 

• Having a large ratio of variables to cases results in an 
overly conservative Wald statistic (discussed below) 
and can lead to nonconvergence. 

• Multicollinearity refers to unacceptably high corre¬ 
lations between predictors. As multicollinearity in¬ 
creases, coefficients remain unbiased but standard 
errors increase and the likelihood of model con¬ 
vergence decreases. 1161 To detect multicollinearity 
amongst the predictors, one can conduct a linear re¬ 
gression analysis with the predictors of interest for 
the sole purpose of examining the tolerance statistic 
1161 used to assess whether multicollinearity is unac¬ 
ceptably high. 

• Sparseness in the data refers to having a large pro¬ 
portion of empty cells (cells with zero counts). Zero 
cell counts are particularly problematic with cate¬ 
gorical predictors. With continuous predictors, the 
model can infer values for the zero cell counts, but 
this is not the case with categorical predictors. The 
model will not converge with zero cell counts for cat¬ 
egorical predictors because the natural logarithm of 
zero is an undefined value, so that final solutions to 
the model cannot be reached. To remedy this prob¬ 
lem, researchers may collapse categories in a the¬ 
oretically meaningful way or add a constant to all 
cells.' 16 ' 

• Another numerical problem that may lead to a lack 
of convergence is complete separation, which refers 
to the instance in which the predictors perfectly pre¬ 
dict the criterion - all cases are accurately classified. 
In such instances, one should reexamine the data, as 
there is likely some kind of error. 1141 

As a general rule of thumb, logistic regression models re¬ 
quire a minimum of about 10 events per explaining vari¬ 
able (where event denotes the cases belonging to the less 
frequent category in the dependent variable). 1181 

Minimum chi-squared estimator for grouped data 

While individual data will have a dependent variable with 
a value of zero or one for every observation, with grouped 
data one observation is on a group of people who all share 
the same characteristics (e.g., demographic characteris¬ 
tics); in this case the researcher observes the proportion 
of people in the group for whom the response variable 
falls into one category or the other. If this proportion 
is neither zero nor one for any group, the minimum chi- 
squared estimator involves using weighted least squares 


to estimate a linear model in which the dependent vari¬ 
able is the logit of the proportion: that is, the log of the 
ratio of the fraction in one group to the fraction in the 
other group. 1 19] : pp- 686 ~ 9 

36.4.2 Evaluating goodness of fit 

Goodness of fit in linear regression models is generally 
measured using the R 2 . Since this has no direct analog in 
logistic regression, various methods' 19 ' :ch - 21 including the 
following can be used instead. 

Deviance and likelihood ratio tests 

In linear regression analysis, one is concerned with par¬ 
titioning variance via the sum of squares calculations - 
variance in the criterion is essentially divided into vari¬ 
ance accounted for by the predictors and residual vari¬ 
ance. In logistic regression analysis, deviance is used 
in lieu of sum of squares calculations.' 20 ' Deviance is 
analogous to the sum of squares calculations in linear 
regression 1141 and is a measure of the lack of fit to the 
data in a logistic regression model.' 20 ' When a “saturated” 
model is available (a model with a theoretically perfect 
fit), deviance is calculated by comparing a given model 
with the saturated model.' 14 ' This computation give the 
likelihood-ratio test:. 114 ' 

^ model fitted the of likelihood 

D = —2 In-. 

model saturated the of likelihood 

In the above equation D represents the deviance and In 
represents the natural logarithm. The log of the likeli¬ 
hood ratio (the ratio of the fitted model to the saturated 
model) will produce a negative value, so the product is 
multiplied by negative two times its natural logarithm to 
produce a value with an approximate chi-squared distri¬ 
bution. 1141 Smaller values indicate better fit as the fitted 
model deviates less from the saturated model. When as¬ 
sessed upon a chi-square distribution, nonsignificant chi- 
square values indicate very little unexplained variance 
and thus, good model fit. Conversely, a significant chi- 
square value indicates that a significant amount of the 
variance is unexplained. 

When the saturated model is not available (a common 
case), deviance is calculated simply as (—2)x(log likeli¬ 
hood of the fitted model), and the reference to the satu¬ 
rated model’s log likelihood can be removed from all that 
follows without harm. 

Two measures of deviance are particularly important in 
logistic regression: null deviance and model deviance. 
The null deviance represents the difference between a 
model with only the intercept (which means “no predic¬ 
tors”) and the saturated model. The model deviance rep¬ 
resents the difference between a model with at least one 
predictor and the saturated model.' 201 In this respect, the 
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null model provides a baseline upon which to compare 
predictor models. Given that deviance is a measure of 
the difference between a given model and the saturated 
model, smaller values indicate better fit. Thus, to assess 
the contribution of a predictor or set of predictors, one 
can subtract the model deviance from the null deviance 
and assess the difference on a chi-square distribu¬ 
tion with degrees of freedom 1141 equal to the difference 
in the number of parameters estimated. 

Let 


Amu = — 2 In 


model null of likelihood 
model saturated the of likelihood 


-//fitted — 2 In 


model fitted of likelihood 
model saturated the of likelihood 


Then 


does not necessarily increase as the odds ratio increases 
and does not necessarily decrease as the odds ratio de¬ 
creases. 

The Cox and Snell R 2 is an alternative index of good¬ 
ness of fit related to the R 2 value from linear regression. 
The Cox and Snell index is problematic as its maximum 
value is .75, when the variance is at its maximum (.25). 
The Nagelkerke R 2 provides a correction to the Cox and 
Snell R 2 so that the maximum value is equal to one. Nev¬ 
ertheless, the Cox and Snell and likelihood ratio R 2 s show 
greater agreement with each other than either does with 
the Nagelkerke R 2 } 1,y Of course, this might not be the 
case for values exceeding .75 as the Cox and Snell index 
is capped at this value. The likelihood ratio R 2 is often 
preferred to the alternatives as it is most analogous to R 2 
in linear regression, is independent of the base rate (both 
Cox and Snell and Nagelkerke R 2 s increase as the propor¬ 
tion of cases increase from 0 to .5) and varies between 0 
and 1. 


//null -//fitted — [ 2 In 


= -2 In 


model null of likelihood 
model saturated the of likelihood 
model null of likelihood 


word of caution is in 
2 tu model fitted of Iikeiitiod 

-statisttpEn The reason these indices of 

, as fceudo 


tiorjatc rcdJfcHk)rfittedi<t>f)ilikt;lthe<)d 4 


i lvi 


eting pseudo- 
t ar c) referred 
propor- 
near regression 


model saturated the of 1 i kel i hood does. 1 tW< idcilrtcutimtgfletbe xif fiJsriihesidipmoscedasticity, 


= - 2 In 


( model null of likelihood 
k model saturated the of likelihood 


= — 2 In 


( model fitted of likelihood \ 

^ model saturated the of likelihood / 

model null the of likelihood 
model fitted of likelihood 


If the model deviance is significantly smaller than the null 
deviance then one can conclude that the predictor or set of 
predictors significantly improved model fit. This is anal¬ 
ogous to the /-’-test used in linear regression analysis to 
assess the significance of prediction. 1201 


that the error variance is the same for all values of the cri¬ 
terion. Logistic regression will always be heteroscedastic 
- the error variances differ for each value of the predicted 
score. For each value of the predicted score there would 
be a different value of the proportionate reduction in er¬ 
ror. Therefore, it is inappropriate to think of R 2 as a pro¬ 
portionate reduction in error in a universal sense in logis¬ 


tic regression. 


[ 20 ] 


Pseudo-R 2 s 

In linear regression the squared multiple correlation, R 2 
is used to assess goodness of fit as it represents the pro¬ 
portion of variance in the criterion that is explained by 
the predictors J 201 In logistic regression analysis, there is 
no agreed upon analogous measure, but there are several 
competing measures each with limitations. 1201 Three of 
the most commonly used indices are examined on this 
page beginning with the likelihood ratio R 2 , R 2 L: |2()| 


Hosmer-Lemeshow test 

The Hosmer-Lemeshow test uses a test statistic that 
asymptotically follows a x 2 distribution to assess whether 
or not the observed event rates match expected event rates 
in subgroups of the model population. 


Evaluating binary classification performance 


Rf = 


(-Lull //fitted 


D, 


null 


This is the most analogous index to the squared multiple 
correlation in linear regression. 1161 It represents the pro¬ 
portional reduction in the deviance wherein the deviance 
is treated as a measure of variation analogous but not 
identical to the variance in linear regression analysis. 1161 
One limitation of the likelihood ratio R 2 is that it is not 
monotonically related to the odds ratio, 1201 meaning that it 


If the estimated probabilities are to be used to classify 
each observation of independent variable values as pre¬ 
dicting the category that the dependent variable is found 
in, the various methods below for judging the model’s 
suitability in out-of-sample forecasting can also be used 
on the data that were used for estimation— accuracy, 
precision (also called positive predictive value), recall 
(also called sensitivity), specificity and negative predic¬ 
tive value. In each of these evaluative methods, an aspect 
of the model’s effectiveness in assigning instances to the 
correct categories is measured. 
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36.5 Coefficients 

After fitting the model, it is likely that researchers will 
want to examine the contribution of individual predic¬ 
tors. To do so, they will want to examine the regression 
coefficients. In linear regression, the regression coeffi¬ 
cients represent the change in the criterion for each unit 
change in the predictor. 1201 In logistic regression, how¬ 
ever, the regression coefficients represent the change in 
the logit for each unit change in the predictor. Given that 
the logit is not intuitive, researchers are likely to focus on 
a predictor’s effect on the exponential function of the re¬ 
gression coefficient - the odds ratio (see definition). In 
linear regression, the significance of a regression coeffi¬ 
cient is assessed by computing a t test. In logistic regres¬ 
sion, there are several different tests designed to assess 
the significance of an individual predictor, most notably 
the likelihood ratio test and the Wald statistic. 


36.5.1 Likelihood ratio test 

The likelihood-ratio test discussed above to assess 
model fit is also the recommended procedure to assess 
the contribution of individual “predictors” to a given 
model. 114111611201 In the case of a single predictor model, 
one simply compares the deviance of the predictor model 
with that of the null model on a chi-square distribution 
with a single degree of freedom. If the predictor model 
has a significantly smaller deviance (c.f chi-square using 
the difference in degrees of freedom of the two models), 
then one can conclude that there is a significant associa¬ 
tion between the “predictor” and the outcome. Although 
some common statistical packages (e.g. SPSS) do pro¬ 
vide likelihood ratio test statistics, without this computa¬ 
tionally intensive test it would be more difficult to assess 
the contribution of individual predictors in the multiple 
logistic regression case. To assess the contribution of in¬ 
dividual predictors one can enter the predictors hierar¬ 
chically, comparing each new model with the previous to 
determine the contribution of each predictor. 1201 There is 
some debate among statisticians about the appropriate¬ 
ness of so-called “stepwise” procedures. The fear is that 
they may not preserve nominal statistical properties and 
may become misleading. 


36.5.2 Wald statistic 

Alternatively, when assessing the contribution of individ¬ 
ual predictors in a given model, one may examine the sig¬ 
nificance of the Wald statistic. The Wald statistic, analo¬ 
gous to the r-test in linear regression, is used to assess the 
significance of coefficients. The Wald statistic is the ratio 
of the square of the regression coefficient to the square of 
the standard error of the coefficient and is asymptotically 
distributed as a chi-square distribution. 1161 


Although several statistical packages (e.g., SPSS, SAS) 
report the Wald statistic to assess the contribution of 
individual predictors, the Wald statistic has limitations. 
When the regression coefficient is large, the standard er¬ 
ror of the regression coefficient also tends to be large 
increasing the probability of Type-II error. The Wald 
statistic also tends to be biased when data are sparse. 1201 

36.5.3 Case-control sampling 

Suppose cases are rare. Then we might wish to sample 
them more frequently than their prevalence in the popula¬ 
tion. For example, suppose there is a disease that affects 
1 person in 10,000 and to collect our data we need to do 
a complete physical. It may be too expensive to do thou¬ 
sands of physicals of healthy people in order to obtain 
data for only a few diseased individuals. Thus, we may 
evaluate more diseased individuals. This is also called 
unbalanced data. As a rule of thumb, sampling controls 
at a rate of five times the number of cases will produce 
sufficient control data. 1211 

If we form a logistic model from such data, if the model 
is correct, the f3j parameters are all correct except for /3 0 
. We can correct /?o if we know the true prevalence as 
follows: 1211 

00 = 00 + l°g - log 

where 7r is the true prevalence and 7r is the prevalence in 
the sample. 

36.6 Formal mathematical specifi¬ 
cation 

There are various equivalent specifications of logistic re¬ 
gression, which fit into different types of more general 
models. These different specifications allow for different 
sorts of useful generalizations. 

36.6.1 Setup 

The basic setup of logistic regression is the same as for 
standard linear regression. 

It is assumed that we have a series of N observed data 
points. Each data point i consists of a set of m explana¬ 
tory variables X\ ,i ... xm,i (also called independent vari¬ 
ables, predictor variables, input variables, features, or at¬ 
tributes), and an associated binary-valued outcome vari¬ 
able Yi (also known as a dependent variable, response 
variable, output variable, outcome variable or class vari¬ 
able), i.e. it can assume only the two possible values 0 (of¬ 
ten meaning “no” or “failure”) or 1 (often meaning “yes” 
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or “success”). The goal of logistic regression is to ex¬ 
plain the relationship between the explanatory variables 
and the outcome, so that an outcome can be predicted for 
a new set of explanatory variables. 

Some examples: 

• The observed outcomes are the presence or absence 
of a given disease (e.g. diabetes) in a set of patients, 
and the explanatory variables might be characteris¬ 
tics of the patients thought to be pertinent (sex, race, 
age, blood pressure, body-mass index, etc.). 

• The observed outcomes are the votes (e.g. 
Democratic or Republican) of a set of people in 
an election, and the explanatory variables are the 
demographic characteristics of each person (e.g. 
sex, race, age, income, etc.). In such a case, one of 
the two outcomes is arbitrarily coded as 1, and the 
other as 0. 

As in linear regression, the outcome variables Yi are as¬ 
sumed to depend on the explanatory variables x\,i... xm,i. 

Explanatory variables 

As shown above in the above examples, the explana¬ 
tory variables may be of any type: real-valued, binary, 
categorical, etc. The main distinction is between 
continuous variables (such as income, age and blood pres¬ 
sure) and discrete variables (such as sex or race). Discrete 
variables referring to more than two possible choices are 
typically coded using dummy variables (or indicator vari¬ 
ables), that is, separate explanatory variables taking the 
value 0 or 1 are created for each possible value of the 
discrete variable, with a 1 meaning “variable does have 
the given value” and a 0 meaning “variable does not have 
that value”. For example, a four-way discrete variable of 
blood type with the possible values “A, B, AB, O” can 
be converted to four separate two-way dummy variables, 
“is-A, is-B, is-AB, is-O”, where only one of them has the 
value 1 and all the rest have the value 0. This allows for 
separate regression coefficients to be matched for each 
possible value of the discrete variable. (In a case like this, 
only three of the four dummy variables are independent 
of each other, in the sense that once the values of three of 
the variables are known, the fourth is automatically deter¬ 
mined. Thus, it is necessary to encode only three of the 
four possibilities as dummy variables. This also means 
that when all four possibilities are encoded, the overall 
model is not identifiable in the absence of additional con¬ 
straints such as a regularization constraint. Theoretically, 
this could cause problems, but in reality almost all logis¬ 
tic regression models are fitted with regularization con¬ 
straints.) 

Outcome variables 


Formally, the outcomes Yi are described as being 
Bernoulli-distributed data, where each outcome is deter¬ 
mined by an unobserved probability pi that is specific 
to the outcome at hand, but related to the explanatory 
variables. This can be expressed in any of the following 
equivalent forms: 


Yi 

| X! >i , . . 

• 5 %m,i ' 

~ Bernoulli 

E[K< 

| Xi ,i, . . 

• 5 ^m,i\ ' 

= Pi 

Pr (Y t = Vi 

Xl ,i, ■ ■ 

• j ' 

_ j Pi if yi = 1 



\l-Pi i%=0 

Pr (Y x = y t 

Xl,u • • 

• j %m,i) ' 

II 

1 

T 

Sr 


The meanings of these four lines are: 

1. The first line expresses the probability distribution 
of each Yi: Conditioned on the explanatory vari¬ 
ables, it follows a Bernoulli distribution with param¬ 
eters pi, the probability of the outcome of 1 for trial 
i. As noted above, each separate trial has its own 
probability of success, just as each trial has its own 
explanatory variables. The probability of success pi 
is not observed, only the outcome of an individual 
Bernoulli trial using that probability. 

2. The second line expresses the fact that the expected 
value of each Yi is equal to the probability of success 
pi, which is a general property of the Bernoulli dis¬ 
tribution. In other words, if we run a large number 
of Bernoulli trials using the same probability of suc¬ 
cess pi, then take the average of all the 1 and 0 out¬ 
comes, then the result would be close to pi. This is 
because doing an average this way simply computes 
the proportion of successes seen, which we expect to 
converge to the underlying probability of success. 

3. The third line writes out the probability mass func¬ 
tion of the Bernoulli distribution, specifying the 
probability of seeing each of the two possible out¬ 
comes. 

4. The fourth line is another way of writing the proba¬ 
bility mass function, which avoids having to write 
separate cases and is more convenient for certain 
types of calculations. This relies on the fact that Yi 
can take only the value 0 or 1. In each case, one of 
the exponents will be 1, “choosing” the value under 
it, while the other is 0, “canceling out” the value un¬ 
der it. Hence, the outcome is either pi or 1 - pi, as 
in the previous line. 

Linear predictor function 

The basic idea of logistic regression is to use the mecha¬ 
nism already developed for linear regression by modeling 
the probability pi using a linear predictor function, i.e. a 
linear combination of the explanatory variables and a set 
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of regression coefficients that are specific to the model 
at hand but the same for all trials. The linear predictor 
function f(i) for a particular data point i is written as: 


/(*) — Ad + PlXl'i + • • • + /3 mX m ,ii 

where (3q ,..., /3„, are regression coefficients indicating 
the relative effect of a particular explanatory variable on 
the outcome. 

The model is usually put into a more compact form as 
follows: 

• The regression coefficients /So, /Si, ..., /3m are 
grouped into a single vector j3 of size m + 1. 

• For each data point i, an additional explanatory 
pseudo-variable xo,i is added, with a fixed value of 
1, corresponding to the intercept coefficient (3 0 . 

• The resulting explanatory variables jco,i, X\,i ,..., xm,i 
are then grouped into a single vector Xi of size m + 
1. 

This makes it possible to write the linear predictor func¬ 
tion as follows: 


using the notation for a dot product between two vectors. 


36.6.2 As a generalized linear model 

The particular model used by logistic regression, which 
distinguishes it from standard linear regression and from 
other types of regression analysis used for binary-valued 
outcomes, is the way the probability of a particular out¬ 
come is linked to the linear predictor function: 


logit(E[Tj | x M ,..., Xm.i]) = logit(pi) = In 



Written using the more compact notation described 
above, this is: 


The intuition for transforming using the logit function (the 
natural log of the odds) was explained above, ft also has 
the practical effect of converting the probability (which is 
bounded to be between 0 and 1) to a variable that ranges 
over (—oo, +oo) — thereby matching the potential range 
of the linear prediction function on the right side of the 
equation. 

Note that both the probabilities pi and the regression co¬ 
efficients are unobserved, and the means of determining 
them is not part of the model itself. They are typically 
determined by some sort of optimization procedure, e.g. 
maximum likelihood estimation, that finds values that 
best fit the observed data (i.e. that give the most accurate 
predictions for the data already observed), usually subject 
to regularization conditions that seek to exclude unlikely 
values, e.g. extremely large values for any of the regres¬ 
sion coefficients. The use of a regularization condition 
is equivalent to doing maximum a posteriori (MAP) esti¬ 
mation, an extension of maximum likelihood. (Regular¬ 
ization is most commonly done using a squared regulariz¬ 
ing function, which is equivalent to placing a zero-mean 
Gaussian prior distribution on the coefficients, but other 
regularizers are also possible.) Whether or not regulariza¬ 
tion is used, it is usually not possible to find a closed-form 
solution; instead, an iterative numerical method must be 
used, such as iteratively reweighted least squares (1RLS) 
or, more commonly these days, a quasi-Newton method 
such as the L-BFGS method. 

The interpretation of the /3j parameter estimates is as the 
additive effect on the log of the odds for a unit change in 
the /th explanatory variable. In the case of a dichotomous 
explanatory variable, for instance gender, is the esti¬ 
mate of the odds of having the outcome for, say, males 
compared with females. 

An equivalent formula uses the inverse of the logit func¬ 
tion, which is the logistic function, i.e.: 


| = pi = logit \(3 ■ X t ) = 1 + e _ <3 . x . 

The formula can also be written as a probability distribu- 
ly,. ijs.ip^aaprobabiIity mass function): 


Pr(ii = yi | Xj) = Pi Vi {l—pi) l ~ Vi 


e P' Xi 

1 + 


Vi 



e P' Xi 

1 + eP' x > 


logit (E[Yj | Xi]) = logit (pi) = In = Z 3 ' x » 

This formulation expresses logistic regression as a type of 
generalized linear model, which predicts variables with 
various types of probability distributions by fitting a lin¬ 
ear predictor function of the above form to some sort of 
arbitrary transformation of the expected value of the vari¬ 
able. 


36.6.3 As a latent-variable model 

The above model has an equivalent formulation as a 
latent-variable model. This formulation is common in the 
theory of discrete choice models, and makes it easier to 
extend to certain more complicated models with multiple, 
correlated choices, as well as to compare logistic regres¬ 
sion to the closely related probit model. 
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Imagine that, for each trial i, there is a continuous latent 
variable Yi* (i.e. an unobserved random variable) that is 
distributed as follows: 


Y*= (3- Xi + £ 

where 


e ~ Logistic (0,1) 

i.e. the latent variable can be written directly in terms of 
the linear predictor function and an additive random error 
variable that is distributed according to a standard logistic 
distribution. 

Then Yi can be viewed as an indicator for whether this 
latent variable is positive: 


Yi = 


if Y* > 0 i.e. - e < /3 ■ Xi, 
otherwise. 


The choice of modeling the error variable specifically 
with a standard logistic distribution, rather than a gen¬ 
eral logistic distribution with the location and scale set 
to arbitrary values, seems restrictive, but in fact it is not. 
It must be kept in mind that we can choose the regres¬ 
sion coefficients ourselves, and very often can use them 
to offset changes in the parameters of the error variable’s 
distribution. For example, a logistic error-variable distri¬ 
bution with a non-zero location parameter it (which sets 
the mean) is equivalent to a distribution with a zero loca¬ 
tion parameter, where f.i has been added to the intercept 
coefficient. Both situations produce the same value for 
Yi* regardless of settings of explanatory variables. Simi¬ 
larly, an arbitrary scale parameter s is equivalent to setting 
the scale parameter to 1 and then dividing all regression 
coefficients by s. In the latter case, the resulting value 
of Yi* will be smaller by a factor of s than in the former 
case, for all sets of explanatory variables — but critically, 
it will always remain on the same side of 0, and hence lead 
to the same Yi choice. 

(Note that this predicts that the irrelevancy of the scale 
parameter may not carry over into more complex models 
where more than two choices are available.) 

It turns out that this formulation is exactly equivalent to 
the preceding one, phrased in terms of the generalized 
linear model and without any latent variables. This can 
be shown as follows, using the fact that the cumulative 
distribution function (CDF) of the standard logistic dis¬ 
tribution is the logistic function, which is the inverse of 
the logit function, i.e. 


Pr(e < x) = logit 1 (x) 
Then: 


Pr(y, = 1 I Xi) = Pr (Yi* > 0 I Xi) 

= Pr(/3 • Xj + £ > 0) 

= Pr(e > —/3 • X,;) 

= Pr(e < j3 ■ X,;) symmetric) is distribution logistic 
= logit -1 (/3 ■ Xi) 

= pi above) (see 

This formulation—which is standard in discrete choice 
models—makes clear the relationship between logistic 
regression (the “logit model”) and the probit model, 
which uses an error variable distributed according to a 
standard normal distribution instead of a standard logis¬ 
tic distribution. Both the logistic and normal distributions 
are symmetric with a basic unimodal, “bell curve” shape. 

The only difference is that the logistic distribution has 
somewhat heavier tails, which means that it is less sensi¬ 
tive to outlying data (and hence somewhat more robust to 
model mis-specifications or erroneous data). 


36.6.4 As a two-way latent-variable model 

Yet another formulation uses two separate latent vari¬ 
ables: 


Y°* =/3 0 -Xi+E 0 
Y^* = (3, ■ Xi + £l 


where 


e 0 ~EV 1 (0,l) 
ei ~EV 1 (0,1) 

where EV^O.l) is a standard type-1 extreme value dis¬ 
tribution: i.e. 


Pr(eo = x) = Pr(ei = x) = e x e e 

Then 


y f 1 ifYj 1 * > Y°*, 

1 0 otherwise. 

This model has a separate latent variable and a separate 
set of regression coefficients for each possible outcome of 
the dependent variable. The reason for this separation is 
that it makes it easy to extend logistic regression to multi¬ 
outcome categorical variables, as in the multinomial logit 
model. In such a model, it is natural to model each pos¬ 
sible outcome using a different set of regression coeffi¬ 
cients. It is also possible to motivate each of the separate 
latent variables as the theoretical utility associated with 
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making the associated choice, and thus motivate logistic 
regression in terms of utility theory. (In terms of utility 
theory, a rational actor always chooses the choice with 
the greatest associated utility.) This is the approach taken 
by economists when formulating discrete choice models, 
because it both provides a theoretically strong foundation 
and facilitates intuitions about the model, which in turn 
makes it easy to consider various sorts of extensions. (See 
the example below.) 

The choice of the type-1 extreme value distribution seems 
fairly arbitrary, but it makes the mathematics work out, 
and it may be possible to justify its use through rational 
choice theory. 

It turns out that this model is equivalent to the previous 
model, although this seems non-obvious, since there are 
now two sets of regression coefficients and error variables, 
and the error variables have a different distribution. In 
fact, this model reduces directly to the previous one with 
the following substitutions: 


(3 — ft i (3 0 

£ = £l — £q 

An intuition for this comes from the fact that, since we 
choose based on the maximum of two values, only their 
difference matters, not the exact values — and this ef¬ 
fectively removes one degree of freedom. Another crit¬ 
ical fact is that the difference of two type-1 extreme- 
value-distributed variables is a logistic distribution, i.e. 
if £ = £\ — £q ~ Logistic(0,1). 

We can demonstrate the equivalent as follows: 

Pr(T ■ = 1 | X ?; ) 

= Pr(y/* > Y°* I X,) 

= Pr(Y/* - Y°* > 0 | X,) 

= Pr(/3 1 • Xi + £i - (J3 0 ■ X, ; + £ „) > 0) 

= Pr((/3i • Xj - f3 0 ■ X t ) + ( £l - e 0 ) > 0) 


We would then use three latent variables, one for each 
choice. Then, in accordance with utility theory, we can 
then interpret the latent variables as expressing the utility 
that results from making each of the choices. We can 
also interpret the regression coefficients as indicating the 
strength that the associated factor (i.e. explanatory vari¬ 
able) has in contributing to the utility — or more cor¬ 
rectly, the amount by which a unit change in an explana¬ 
tory variable changes the utility of a given choice. A 
voter might expect that the right-of-center party would 
lower taxes, especially on rich people. This would give 
low-income people no benefit, i.e. no change in utility 
(since they usually don't pay taxes); would cause mod¬ 
erate benefit (i.e. somewhat more money, or moderate 
utility increase) for middle-incoming people; and would 
cause significant benefits for high-income people. On the 
other hand, the left-of-center party might be expected to 
raise taxes and offset it with increased welfare and other 
assistance for the lower and middle classes. This would 
cause significant positive benefit to low-income people, 
perhaps weak benefit to middle-income people, and sig¬ 
nificant negative benefit to high-income people. Finally, 
the secessionist party would take no direct actions on the 
economy, but simply secede. A low-income or middle- 
income voter might expect basically no clear utility gain 
or loss from this, but a high-income voter might expect 
negative utility, since he/she is likely to own companies, 
which will have a harder time doing business in such an 
environment and probably lose money. 

These intuitions can be expressed as follows: 

This clearly shows that 

1. Separate sets of regression coefficients need to exist 
for each choice. When phrased in terms of utility, 
this can be seen very easily. Different choices have 
different effects on net utility; furthermore, the ef¬ 
fects vary in complex ways that depend on the char¬ 
acteristics of each individual, so there need to be 
separate sets of coefficients for each characteristic, 
not simply a single extra per-choice characteristic. 


= Pr((/3 1 — /3 0 ) • Xj + ( £l — e 0 ) > 0) 
= Pr((/3 1 — (3 0 ) • Xj + e > 0) 

= Pr(/3 ■ Xi + £ > 0) 

= Pr( £ > —f3 ■ Xi) 

-- Pr( £ < (3 • X0 
= logit” 1 ((3 - Xi) 

-Pi 


2. Even though income is a continuous variable, its ef¬ 
fect on utility is too complex for it to be treated as 
(su stitute £ a oyf^j^ig variable. Either it needs to be directly split 
(substitute/3aboj(g)j^ 0 ran g eS- or higher powers of income need to 
model) above asbs&aitdfMbso, that polynomial regression on income is 
effectively done. 


36.6.5 As a “log-linear” model 


Example 

As an example, consider a province-level election where 
the choice is between a right-of-center party, a left-of- 
center party, and a secessionist party (e.g. the Parti 
Quebecois, which wants Quebec to secede from Canada). 


Yet another formulation combines the two-way latent 
variable formulation above with the original formulation 
higher up without latent variables, and in the process pro¬ 
vides a link to one of the standard formulations of the 
multinomial logit. 

Here, instead of writing the logit of the probabilities pi 
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as a linear predictor, we separate the linear predictor into 
two, one for each of the two outcomes: 

lnPr(Yi = 0) =(3 0 -X z -\nZ 
InPr (Yi = 1) = (3 1 ■ X* - \nZ 

Note that two separate sets of regression coefficients have 
been introduced, just as in the two-way latent variable 
model, and the two equations appear a form that writes 
the logarithm of the associated probability as a linear pre¬ 
dictor, with an extra term —InZ at the end. This term, as 
it turns out, serves as the normalizing factor ensuring that 
the result is a distribution. This can be seen by exponen¬ 
tiating both sides: 

Pr {Yt = 0) = 

Pr(F, = 1) = 


the model is nonidentifiable, in that multiple combina¬ 
tions of (i a and ji\ will produce the same probabilities 
for all possible explanatory variables. In fact, it can be 
seen that adding any constant vector to both of them will 
produce the same probabilities: 

e (Pi+C )’Xi 

Pr(Yi = 1) = e{l a o+ c).x, +e (^ 1 +c)-x i 

g/Sj-XigC-Xi 

= e /3 0 -X, e C-X, +e l3 1 -X ie CX, 

gC-Xi^^Xi 

= e C-X,( e /3 0 .X, +e /3 rXl ) 
e /3rX, 

gi^o’Xi g/3i ‘Xi 

As a result, we can simplify matters, and restore identi- 
fiability, by picking an arbitrary value for one of the two 
vectors. We choose to set /3 0 = 0. Then, 


In this form it is clear that the purpose of Z is to en¬ 
sure that the resulting distribution over Yi is in fact a 
probability distribution, i.e. it sums to 1. This means that 
Z is simply the sum of all un-normalized probabilities, 
and by dividing each probability by Z, the probabilities 
become "normalized". That is: 


g/VX, = e o.x, = 1 
and so 


Pr(Y; = 1) 


e^' x - 
1 + e0i' Xi 


1 

1 + e-ft-x, -P* 


Z = e PoN +e 0 1 ^ 

and the resulting equations are 


Pr(^: = 0) 


Pr(^ ; = 1) 


g/3o'Xi 

g/3 0 'Xi _|_ g£h-Xi 

e^ rX * 

g/3 0 'Xi _|_ g/3i-Xi 


Or generally: 


Pr(Y, = c) 


e Pc X i 

J2h ePh ' Xi 


which shows that this formulation is indeed equivalent to 
the previous formulation. (As in the two-way latent vari¬ 
able formulation, any settings where f3 = f3 1 — /3 0 will 
produce equivalent results.) 

Note that most treatments of the multinomial logit model 
start out either by extending the “log-linear” formulation 
presented here or the two-way latent variable formulation 
presented above, since both clearly show the way that the 
model could be extended to multi-way outcomes. In gen¬ 
eral, the presentation with latent variables is more com¬ 
mon in econometrics and political science, where discrete 
choice models and utility theory reign, while the “log- 
linear” formulation here is more common in computer 
science, e.g. machine learning and natural language pro¬ 
cessing. 


This shows clearly how to generalize this formulation to 
more than two outcomes, as in multinomial logit. Note 
that this general formulation is exactly the Softmax func¬ 
tion as in 


36.6.6 As a single-layer perceptron 

The model has an equivalent formulation 


Pr (Yi = c) = softmax(c,/3 0 • X ll (3 1 ■ X i; ...). 

In order to prove that this is equivalent to the previous 
model, note that the above model is overspecified, in that 
Pr (Yi = 0) and Pr (Yi = 1) cannot be independently 
specified: rather Pr(K,: = 0) + Pr(V) = 1) = 1 so know¬ 
ing one automatically determines the other. As a result. 


1 

l _|_ g-(/3o+/3i^i,i-l- \-0k%k,i) ' 

This functional form is commonly called a single-layer 
perceptron or single-layer artificial neural network. A 
single-layer neural network computes a continuous out¬ 
put instead of a step function. The derivative of pi with 
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respect to X = (xi, ..., xk) is computed from the general 
form: 


1 

y ~ i + e -H x ) 

where f(X) is an analytic function in X. With this choice, 
the single-layer neural network is identical to the logistic 
regression model. This function has a continuous deriva¬ 
tive, which allows it to be used in backpropagation. This 
function is also preferred because its derivative is easily 
calculated: 


d y 
dX 


y(i - y ) 


d/ 
dX' 



36.6.7 In terms of binomial data 

A closely related model assumes that each i is associated 
not with a single Bernoulli trial but with ni independent 
identically distributed trials, where the observation Yi is 
the number of successes observed (the sum of the individ¬ 
ual Bernoulli-distributed random variables), and hence 
follows a binomial distribution: 


Yi ~ Bin (rii,pi), for i = 1,..., n 


Comparison of logistic function with a scaled inverse probit func¬ 
tion {i.e. the CDF of the normal distribution), comparing o~(x) 
Vi. < f > (^/^ 2 :) , which makes the slopes the same at the origin. 
This shows the heavier tails of the logistic distribution. 

the form of Gaussian distributions. Unfortunately, the 
Gaussian distribution is not the conjugate prior of the 
likelihood function in logistic regression. As a result, the 
posterior distribution is difficult to calculate, even using 
standard simulation algorithms (e.g. Gibbs sampling). 


An example of this distribution is the fraction of seeds 
(pi) that germinate after ni are planted. 

In terms of expected values, this model is expressed as 
follows: 


'Yi 

X* 

. rii 



There are various possibilities: 

• Don't do a proper Bayesian analysis, but simply 
compute a maximum a posteriori point estimate of 
the parameters. This is common, for example, in 
“maximum entropy” classifiers in machine learning. 

• Use a more general approximation method such as 
the Metropolis-Hastings algorithm. 


so that 


logit ( E — 

V L n i 

Or equivalently: 


= logit fa) = In 


Pi 


1 - Pi 


Pr(Yi = yi | Xi) = ‘ Pf(l-Pi) 


• Draw a Markov chain Monte Carlo sample from 
the exact posterior by using the Independent 
Metropolis-Hastings algorithm with heavy-tailed 
= /3-Xj, multivariate candidate distribution found by match¬ 
ing the mode and curvature at the mode of the 
normal approximation to the posterior and then 
using the Student’s t shape with low degrees of 
freedom. 1221 This is shown to have excellent conver- 

ni-yi 


Vi 


1 


1 + e-ffi 


genyfc' properties. 


Use a latent 


i 1 

ate 


1 


: virikl&i 


This model can be fit using the same sorts of methods as 
the above more basic model. 


e mo‘d6l and approximate the lo¬ 
gistic distribution using a more tractable distribu¬ 
tion, e.g. a Student’s t-distribution or a mixture of 
normal distributions. 


36.7 Bayesian logistic regression 

In a Bayesian statistics context, prior distributions are 
normally placed on the regression coefficients, usually in 


• Do probit regression instead of logistic regression. 
This is actually a special case of the previous situ¬ 
ation, using a normal distribution in place of a Stu¬ 
dent’s t, mixture of normals, etc. This will be less 
accurate but has the advantage that probit regression 
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is extremely common, and a ready-made Bayesian 
implementation may already be available. 

• Use the Laplace approximation of the posterior 
distribution. 1231 This approximates the posterior 
with a Gaussian distribution. This is not a terribly 
good approximation, but it suffices if all that is de¬ 
sired is an estimate of the posterior mean and vari¬ 
ance. In such a case, an approximation scheme such 
as variational Bayes can be used. 1241 

36.7.1 Gibbs sampling with an approxi¬ 
mating distribution 

As shown above, logistic regression is equivalent to a 
latent variable model with an error variable distributed 
according to a standard logistic distribution. The over¬ 
all distribution of the latent variable Yj* is also a logistic 
distribution, with the mean equal to (3 ■ X, (i.e. the fixed 
quantity added to the error variable). This model con¬ 
siderably simplifies the application of techniques such as 
Gibbs sampling. However, sampling the regression coef¬ 
ficients is still difficult, because of the lack of conjugacy 
between the normal and logistic distributions. Changing 
the prior distribution over the regression coefficients is 
of no help, because the logistic distribution is not in the 
exponential family and thus has no conjugate prior. 

One possibility is to use a more general Markov chain 
Monte Carlo technique, such as the Metropolis-Hastings 
algorithm, which can sample arbitrary distributions. An¬ 
other possibility, however, is to replace the logistic dis¬ 
tribution with a similar-shaped distribution that is easier 
to work with using Gibbs sampling. In fact, the logistic 
and normal distributions have a similar shape, and thus 
one possibility is simply to have normally distributed er¬ 
rors. Because the normal distribution is conjugate to it¬ 
self, sampling the regression coefficients becomes easy. 
In fact, this model is exactly the model used in probit re¬ 
gression. 

However, the normal and logistic distributions differ in 
that the logistic has heavier tails. As a result, it is more 
robust to inaccuracies in the underlying model (which are 
inevitable, in that the model is essentially always an ap¬ 
proximation) or to errors in the data. Probit regression 
loses some of this robustness. 

Another alternative is to use errors distributed as a 
Student’s t-distribution. The Student’s t-distribution has 
heavy tails, and is easy to sample from because it is the 
compound distribution of a normal distribution with vari¬ 
ance distributed as an inverse gamma distribution. In 
other words, if a normal distribution is used for the er¬ 
ror variable, and another latent variable, following an in¬ 
verse gamma distribution, is added corresponding to the 
variance of this error variable, the marginal distribution 
of the error variable will follow a Student’s t distribution. 
Because of the various conjugacy relationships, all vari¬ 


ables in this model are easy to sample from. 

The Student’s t distribution that best approximates a stan¬ 
dard logistic distribution can be determined by matching 
the moments of the two distributions. The Student’s t dis¬ 
tribution has three parameters, and since the skewness of 
both distributions is always 0, the first four moments can 
all be matched, using the following equations: 

fj, = 0 

2 

V 9 7T 

v-2 S = T 
6 _ 6 
v — A ~ 5 

This yields the following values: 

fJL = 0 

7 7r 2 
9l 

v = 9 

The following graphs compare the standard logistic dis¬ 
tribution with the Student’s t distribution that matches the 
first four moments using the above-determined values, as 
well as the normal distribution that matches the first two 
moments. Note how much closer the Student’s t distri¬ 
bution agrees, especially in the tails. Beyond about two 
standard deviations from the mean, the logistic and nor¬ 
mal distributions diverge rapidly, but the logistic and Stu¬ 
dent’s t distributions don't start diverging significantly un¬ 
til more than 5 standard deviations away. 

(Another possibility, also amenable to Gibbs sampling, is 
to approximate the logistic distribution using a mixture 
density of normal distributions.) 

36.8 Extensions 

There are large numbers of extensions: 

• Multinomial logistic regression (or multinomial 
logit) handles the case of a multi-way categorical de¬ 
pendent variable (with unordered values, also called 
“classification”). Note that the general case of hav¬ 
ing dependent variables with more than two values 
is termed polytomous regression. 

• Ordered logistic regression (or ordered logit) han¬ 
dles ordinal dependent variables (ordered values). 

• Mixed logit is an extension of multinomial logit that 
allows for correlations among the choices of the de¬ 
pendent variable. 

• An extension of the logistic model to sets of inter¬ 
dependent variables is the conditional random field. 
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36.9 Model suitability 

A way to measure a model’s suitability is to assess the 
model against a set of data that was not used to create 
the model. 1251 The class of techniques is called cross- 
validation. This holdout model assessment method is par¬ 
ticularly valuable when data are collected in different set¬ 
tings (e.g., at different times or places) or when models 
are assumed to be generalizable. 

To measure the suitability of a binary regression model, 
one can classify both the actual value and the predicted 
value of each observation as either 0 or l. 1261 The pre¬ 
dicted value of an observation can be set equal to 1 if 
the estimated probability that the observation equals 1 is 
above ^ , and set equal to 0 if the estimated probability 
is below i . Here logistic regression is being used as a 
binary classification model. There are four possible com¬ 
bined classifications: 

1. prediction of 0 when the holdout sample has a 0 
(True Negatives, the number of which is TN) 

2. prediction of 0 when the holdout sample has a 1 
(False Negatives, the number of which is FN) 

3. prediction of 1 when the holdout sample has a 0 
(False Positives, the number of which is FP) 

4. prediction of 1 when the holdout sample has a 1 
(True Positives, the number of which is TP) 

These classifications are used to calculate accuracy, pre¬ 
cision (also called positive predictive value), recall (also 
called sensitivity), specificity and negative predictive 
value: 

TP + TN 

A “ y = TP + FP + FN + TN 

TP 

Precision = value predictive Positive = —- 

F TP + FP 

TN 

value predictive Negative = - 

F & TN + FN 

TP 

Recall = Sensitivity = Tp + pN 

_ . TN 

Sp “ flc,ts ’ = tnTfp 

36.10 See also 

• Logistic function 

• Discrete choice 

• Jarrow-Turnbull model 

• Limited dependent variable 


• Multinomial logit model 

• Ordered logit 

• Hosmer-Lemeshow test 

• Brier score 

• MLPACK - contains a C++ implementation of lo¬ 
gistic regression 

• Local case-control sampling 
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Chapter 37 

Linear discriminant analysis 


Not to be confused with latent Dirichlet allocation. 


Linear discriminant analysis (LDA) is a generaliza¬ 
tion of Fisher’s linear discriminant, a method used in 
statistics, pattern recognition and machine learning to find 
a linear combination of features that characterizes or sep¬ 
arates two or more classes of objects or events. The re¬ 
sulting combination may be used as a linear classifier, 
or, more commonly, for dimensionality reduction before 
later classification. 

LDA is closely related to analysis of variance (ANOVA) 
and regression analysis, which also attempt to express 
one dependent variable as a linear combination of 
other features or measurements. 1 ' 1121 However, ANOVA 
uses categorical independent variables and a continuous 
dependent variable, whereas discriminant analysis has 
continuous independent variables and a categorical de¬ 
pendent variable (i.e. the class label). 131 Logistic regres¬ 
sion and probit regression are more similar to LDA than 
ANOVA is, as they also explain a categorical variable by 
the values of continuous independent variables. These 
other methods are preferable in applications where it is 
not reasonable to assume that the independent variables 
are normally distributed, which is a fundamental assump¬ 
tion of the LDA method. 

LDA is also closely related to principal component anal¬ 
ysis (PCA) and factor analysis in that they both look for 
linear combinations of variables which best explain the 
data. 141 LDA explicitly attempts to model the difference 
between the classes of data. PCA on the other hand does 
not take into account any difference in class, and factor 
analysis builds the feature combinations based on differ¬ 
ences rather than similarities. Discriminant analysis is 
also different from factor analysis in that it is not an in¬ 
terdependence technique: a distinction between indepen¬ 
dent variables and dependent variables (also called crite¬ 
rion variables) must be made. 

LDA works when the measurements made on indepen¬ 
dent variables for each observation are continuous quan¬ 
tities. When dealing with categorical independent vari¬ 
ables, the equivalent technique is discriminant correspon¬ 
dence analysis. 151161 


37.1 LDA for two classes 

Consider a set of observations x (also called features, at¬ 
tributes, variables or measurements) for each sample of 
an object or event with known class y. This set of samples 
is called the training set. The classification problem is 
then to find a good predictor for the class y of any sample 
of the same distribution (not necessarily from the training 
set) given only an observation x . |7|:338 

LDA approaches the problem by assuming that the con¬ 
ditional probability density functions p(x\y = 0) and 
p(x\y = 1) are both normally distributed with mean and 
covariance parameters (/xo,£o) and (/tti,Ei) , respec¬ 
tively. Under this assumption, the Bayes optimal solution 
is to predict points as being from the second class if the 
log of the likelihood ratios is below some threshold T, so 
that; 


(£-/ 7 o ) t £ 0 1 (^-Mo)+ln|So|-(^-/ti) T ^ 1 1 (^'—/7i)—In|Si| < T 


Without any further assumptions, the resulting classifier 
is referred to as QDA (quadratic discriminant analysis). 

LDA instead makes the additional simplifying 
homoscedasticity assumption (i.e. that the class co- 
variances are identical, so Eo = Ei = E ) and that the 
covariances have full rank. In this case, several terms 
cancel: 


x T Yi 0 1 x = x T T, 1 1 x 

x T Y,i~ 1 iZi = because E,; is 

Hermitian 

and the above decision criterion becomes a threshold on 
the dot product 


w ■ X > c 

for some threshold constant c, where 


w = E 1 (/2 1 - po) 
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c — -(T — fLo T T, 0 1 /T q + ii i T S 1 1 /Ti) 

This means that the criterion of an input x being in a class 
y is purely a function of this linear combination of the 
known observations. 

It is often useful to see this conclusion in geometrical 
terms: the criterion of an input x being in a class y is 
purely a function of projection of multidimensional-space 
point x onto vector w (thus, we only consider its direc¬ 
tion). In other words, the observation belongs to y if cor¬ 
responding x is located on a certain side of a hyperplane 
perpendicular to w . The location of the plane is defined 
by the threshold c. 

37.2 Canonical discriminant anal¬ 
ysis for k classes 

Canonical discriminant analysis (CDA) finds axes (k - 1 
canonical coordinates, k being the number of classes) that 
best separate the categories. These linear functions are 
uncorrelated and define, in effect, an optimal k— 1 space 
through the n -dimensional cloud of data that best sepa¬ 
rates (the projections in that space of) the k groups. See 
“Multiclass LDA” for details below. 


37.3 Fisher’s linear discriminant 

The terms Fisher’s linear discriminant and LDA are often 
used interchangeably, although Fisher’s original article 111 
actually describes a slightly different discriminant, which 
does not make some of the assumptions of LDA such as 
normally distributed classes or equal class covariances. 

Suppose two classes of observations have means jlo,Fi 
and covariances E 0 , Si . Then the linear combination of 
features w-x will have means w-jli and variances w T T,iW 
for i = 0,1. Fisher defined the separation between these 
two distributions to be the ratio of the variance between 
the classes to the variance within the classes: 


sional problem, the line that best divides the two groups 
is perpendicular to w . 

Generally, the data points to be discriminated are pro¬ 
jected onto w ; then the threshold that best separates the 
data is chosen from analysis of the one-dimensional dis¬ 
tribution. There is no general rule for the threshold. How¬ 
ever, if projections of points from both classes exhibit ap¬ 
proximately the same distributions, a good choice would 
be the hyperplane between projections of the two means, 
w ■ jlo and w ■ jl-\ . I n this case the parameter c in threshold 
condition w ■ x > c can be found explicitly: 


C= W ■ -{jlo +fh) = -/2‘iE Vi - -/4s Vo 

Otsu’s Method is related to Fisher’s linear discriminant, 
and was created to binarize the histogram of pixels in 
a grayscale image by optimally picking the black/white 
threshold that minimizes intra-class variance and maxi¬ 
mizes inter-class variance within/between grayscales as¬ 
signed to black and white pixel classes. 


37.4 Multiclass LDA 

In the case where there are more than two classes, the 
analysis used in the derivation of the Fisher discriminant 
can be extended to find a subspace which appears to con¬ 
tain all of the class variability. This generalization is due 
to C.R. Rao. 181 Suppose that each of C classes has a mean 
Hi and the same covariance E . Then the scatter between 
class variability may be defined by the sample covariance 
of the class means 


Sb = ^ - f)(ih - f) t 

i= 1 

where /i is the mean of the class means. The class sepa¬ 
ration in a direction w in this case will be given by 


(w • (jh - jlo )) 2 _ W T E b W 

vF( Eo + Eijw; “ liFZw 


o = ^between = (V ' jk ~ W ■ ftp ) 2 
^within W I "E‘lW + vFTiqW 

This measure is, in some sense, a measure of the signal- 
to-noise ratio for the class labelling. It can be shown that 
the maximum separation occurs when 


w oc (E 0 + Ei) 1 (/2 1 - jlo) 

When the assumptions of LDA are satisfied, the above 
equation is equivalent to LDA. 

Be sure to note that the vector w is the normal to the dis¬ 
criminant hyperplane. As an example, in a two dimen- 


This means that when w is an eigenvector of E 1 E/, the 
separation will be equal to the corresponding eigenvalue. 

If E _1 E{, is diagonalizable, the variability between fea¬ 
tures will be contained in the subspace spanned by the 
eigenvectors corresponding to the C - 1 largest eigenval¬ 
ues (since Ef, is of rank C - 1 at most). These eigenvec¬ 
tors are primarily used in feature reduction, as in PCA. 
The eigenvectors corresponding to the smaller eigenval¬ 
ues will tend to be very sensitive to the exact choice of 
training data, and it is often necessary to use regularisa- 
tion as described in the next section. 
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If classification is required, instead of dimension reduc¬ 
tion, there are a number of alternative techniques avail¬ 
able. For instance, the classes may be partitioned, and 
a standard Fisher discriminant or LDA used to classify 
each partition. A common example of this is “one against 
the rest” where the points from one class are put in one 
group, and everything else in the other, and then LDA 
applied. This will result in C classifiers, whose results are 
combined. Another common method is pairwise classifi¬ 
cation, where a new classifier is created for each pair of 
classes (giving C(C - l)/2 classifiers in total), with the 
individual classifiers combined to produce a final classi¬ 
fication. 


37.5 Practical use 

In practice, the class means and covariances are not 
known. They can, however, be estimated from the train¬ 
ing set. Either the maximum likelihood estimate or the 
maximum a posteriori estimate may be used in place of 
the exact value in the above equations. Although the es¬ 
timates of the covariance may be considered optimal in 
some sense, this does not mean that the resulting discrim¬ 
inant obtained by substituting these values is optimal in 
any sense, even if the assumption of normally distributed 
classes is correct. 

Another complication in applying LDA and Fisher’s dis¬ 
criminant to real data occurs when the number of mea¬ 
surements of each sample exceeds the number of sam¬ 
ples in each class. 141 In this case, the covariance estimates 
do not have full rank, and so cannot be inverted. There 
are a number of ways to deal with this. One is to use a 
pseudo inverse instead of the usual matrix inverse in the 
above formulae. However, better numeric stability may 
be achieved by first projecting the problem onto the sub¬ 
space spanned by £& A 91 Another strategy to deal with 
small sample size is to use a shrinkage estimator of the 
covariance matrix, which can be expressed mathemati¬ 
cally as 


£ = (1 — A)£ + XI 

where I is the identity matrix, and A is the shrink¬ 
age intensity or regularisation parameter. This leads to 
the framework of regularized discriminant analysis 1401 or 
shrinkage discriminant analysis. 1111 

Also, in many practical cases linear discriminants are not 
suitable. LDA and Fisher’s discriminant can be extended 
for use in non-linear classification via the kernel trick. 
Here, the original observations are effectively mapped 
into a higher dimensional non-linear space. Linear classi¬ 
fication in this non-linear space is then equivalent to non¬ 
linear classification in the original space. The most com¬ 
monly used example of this is the kernel Fisher discrim¬ 
inant. 


LDA can be generalized to multiple discriminant analy¬ 
sis, where c becomes a categorical variable with N possi¬ 
ble states, instead of only two. Analogously, if the class- 
conditional densities p{x\c = i) are normal with shared 
covariances, the sufficient statistic for .P(c|ir) are the val¬ 
ues of N projections, which are the subspace spanned 
by the N means, affine projected by the inverse covari¬ 
ance matrix. These projections can be found by solving 
a generalized eigenvalue problem, where the numerator 
is the covariance matrix formed by treating the means as 
the samples, and the denominator is the shared covari¬ 
ance matrix. 


37.6 Applications 

In addition to the examples given below, LDA is applied 
in positioning and product management. 

37.6.1 Bankruptcy prediction 

In bankruptcy prediction based on accounting ratios and 
other financial variables, linear discriminant analysis was 
the first statistical method applied to systematically ex¬ 
plain which firms entered bankruptcy vs. survived. De¬ 
spite limitations including known nonconformance of ac¬ 
counting ratios to the normal distribution assumptions 
of LDA, Edward Altman's 1968 model is still a leading 
model in practical applications. 

37.6.2 Face recognition 

In computerised face recognition, each face is repre¬ 
sented by a large number of pixel values. Linear discrim¬ 
inant analysis is primarily used here to reduce the number 
of features to a more manageable number before classi¬ 
fication. Each of the new dimensions is a linear combi¬ 
nation of pixel values, which form a template. The linear 
combinations obtained using Fisher’s linear discriminant 
are called Fisher faces, while those obtained using the re¬ 
lated principal component analysis are called eigenfaces. 

37.6.3 Marketing 

In marketing, discriminant analysis was once often used 
to determine the factors which distinguish different types 
of customers and/or products on the basis of surveys or 
other forms of collected data. Logistic regression or other 
methods are now more commonly used. The use of dis¬ 
criminant analysis in marketing can be described by the 
following steps: 

1. Formulate the problem and gather data — Identify 
the salient attributes consumers use to evaluate prod¬ 
ucts in this category — Use quantitative market¬ 
ing research techniques (such as surveys) to collect 
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data from a sample of potential customers concern¬ 
ing their ratings of all the product attributes. The 
data collection stage is usually done by marketing 
research professionals. Survey questions ask the re¬ 
spondent to rate a product from one to five (or 1 to 
7, or 1 to 10) on a range of attributes chosen by 
the researcher. Anywhere from five to twenty at¬ 
tributes are chosen. They could include things like: 
ease of use, weight, accuracy, durability, colourful¬ 
ness, price, or size. The attributes chosen will vary 
depending on the product being studied. The same 
question is asked about all the products in the study. 
The data for multiple products is codified and input 
into a statistical program such as R, SPSS or SAS. 
(This step is the same as in Factor analysis). 

2. Estimate the Discriminant Function Coefficients 
and determine the statistical significance and valid¬ 
ity — Choose the appropriate discriminant analy¬ 
sis method. The direct method involves estimating 
the discriminant function so that all the predictors 
are assessed simultaneously. The stepwise method 
enters the predictors sequentially. The two-group 
method should be used when the dependent variable 
has two categories or states. The multiple discrim¬ 
inant method is used when the dependent variable 
has three or more categorical states. Use Wilks’s 
Lambda to test for significance in SPSS or F stat in 
SAS. The most common method used to test valid¬ 
ity is to split the sample into an estimation or analy¬ 
sis sample, and a validation or holdout sample. The 
estimation sample is used in constructing the dis¬ 
criminant function. The validation sample is used to 
construct a classification matrix which contains the 
number of correctly classified and incorrectly clas¬ 
sified cases. The percentage of correctly classified 
cases is called the hit ratio. 

3. Plot the results on a two dimensional map, define 
the dimensions, and interpret the results. The sta¬ 
tistical program (or a related module) will map the 
results. The map will plot each product (usually in 
two-dimensional space). The distance of products 
to each other indicate either how different they are. 
The dimensions must be labelled by the researcher. 
This requires subjective judgement and is often very 
challenging. See perceptual mapping. 

37.6.4 Biomedical studies 

The main application of discriminant analysis in 
medicine is the assessment of severity state of a patient 
and prognosis of disease outcome. For example, during 
retrospective analysis, patients are divided into groups ac¬ 
cording to severity of disease - mild, moderate and severe 
form. Then results of clinical and laboratory analyses are 
studied in order to reveal variables which are statistically 
different in studied groups. Using these variables, dis¬ 


criminant functions are built which help to objectively 
classify disease in a future patient into mild, moderate 
or severe form. 

In biology, similar principles are used in order to clas¬ 
sify and define groups of different biological objects, for 
example, to define phage types of Salmonella enteritidis 
based on Fourier transform infrared spectra, 1 12 to detect 
animal source of Escherichia coli studying its virulence 
factors 1131 etc. 

37.6.5 Earth Science 

This method can be used to separate the alteration zones. 
For example, when different data from various zones are 
available, discriminate analysis can find the pattern within 
the data and classify the them effectively 1141 

37.7 See also 

• Data mining 

• Decision tree learning 

• Factor analysis 

• Kernel Fisher discriminant analysis 

• Logit (for logistic regression) 

• Multidimensional scaling 

• Multilinear subspace learning 

• Pattern recognition 

• Perceptron 

• Preference regression 

• Quadratic classifier 
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37.10 External links 
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Chapter 38 

Naive Bayes classifier 


In machine learning, naive Bayes classifiers are a fam¬ 
ily of simple probabilistic classifiers based on apply¬ 
ing Bayes’ theorem with strong (naive) independence as¬ 
sumptions between the features. 

Naive Bayes has been studied extensively since the 1950s. 
It was introduced under a different name into the text re¬ 
trieval community in the early 1960s, [1,:48S and remains 
a popular (baseline) method for text categorization, the 
problem of judging documents as belonging to one cat¬ 
egory or the other (such as spam or legitimate, sports 
or politics, etc.) with word frequencies as the features. 
With appropriate preprocessing, it is competitive in this 
domain with more advanced methods including support 
vector machines. 121 It also finds application in automatic 
medical diagnosis. 131 

Naive Bayes classifiers are highly scalable, requiring a 
number of parameters linear in the number of variables 
(features/predictors) in a learning problem. Maximum- 
likelihood training can be done by evaluating a closed- 
form expression, 111:7 18 which takes linear time, rather 
than by expensive iterative approximation as used for 
many other types of classifiers. 

In the statistics and computer science literature, Naive 
Bayes models are known under a variety of names, in¬ 
cluding simple Bayes and independence Bayes. 141 All 
these names reference the use of Bayes’ theorem in the 
classifier’s decision rule, but naive Bayes is not (necessar¬ 
ily) a Bayesian method; 141 Russell and Norvig note that 
"[naive Bayes] is sometimes called a Bayesian classi¬ 
fier, a somewhat careless usage that has prompted true 
Bayesians to call it the idiot Bayes model.” [1]:4S2 


38.1 Introduction 

Naive Bayes is a simple technique for constructing classi¬ 
fiers; models that assign class labels to problem instances, 
represented as vectors of feature values, where the class 
labels are drawn from some finite set. It is not a single 
algorithm for training such classifiers, but a family of al¬ 
gorithms based on a common principle: all naive Bayes 
classifiers assume that the value of a particular feature 
is independent of the value of any other feature, given 


the class variable. For example, a fruit may be consid¬ 
ered to be an apple if it is red, round, and about 3 cm 
in diameter. A naive Bayes classifier considers each of 
these features to contribute independently to the proba¬ 
bility that this fruit is an apple, regardless of any possible 
correlations between the color, roundness and diameter 
features. 

For some types of probability models, naive Bayes classi¬ 
fiers can be trained very efficiently in a supervised learn¬ 
ing setting. In many practical applications, parameter 
estimation for naive Bayes models uses the method of 
maximum likelihood; in other words, one can work with 
the naive Bayes model without accepting Bayesian prob¬ 
ability or using any Bayesian methods. 

Despite their naive design and apparently oversimplified 
assumptions, naive Bayes classifiers have worked quite 
well in many complex real-world situations. In 2004, an 
analysis of the Bayesian classification problem showed 
that there are sound theoretical reasons for the apparently 
implausible efficacy of naive Bayes classifiers. 151 Still, a 
comprehensive comparison with other classification al¬ 
gorithms in 2006 showed that Bayes classification is out¬ 
performed by other approaches, such as boosted trees or 
random forests. 161 

An advantage of naive Bayes is that it only requires a small 
amount of training data to estimate the parameters nec¬ 
essary for classification. 

38.2 Probabilistic model 

Abstractly, naive Bayes is a conditional probability 
model: given a problem instance to be classified, repre¬ 
sented by a vector x = (x \,..., x n ) representing some n 
features (independent variables), it assigns to this instance 
probabilities 


p(C k \xi, ...,x n ) 

for each of k possible outcomes or classes. 11 ^ 

The problem with the above formulation is that if the 
number of features n is large or if a feature can take on 
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a large number of values, then basing such a model on 
probability tables is infeasible. We therefore reformulate 
the model to make it more tractable. Using Bayes’ theo- p(C k \x\ 
rem, the conditional probability can be decomposed as 


1 

-^p{Ck)\\_p{xi\C k ) 

i =1 


p{C k |x) 


p{C k )p(x\C k ) 

P( x ) 


where the evidence Z = p(x) is a scahng factor depen¬ 
dent only on Xi,... ,x n , that is, a constant if the values 
of the feature variables are known. 


In plain English, using Bayesian probability terminology, 
the above equation can be written as 

prior x likelihood 

posterior =-. 

evidence 

In practice, there is interest only in the numerator of that 
fraction, because the denominator does not depend on C 
and the values of the features F t are given, so that the 
denominator is effectively constant. The numerator is 
equivalent to the joint probability model 


38.2.1 Constructing a classifier from the 
probability model 

The discussion so far has derived the independent feature 
model, that is, the naive Bayes probability model. The 
naive Bayes classifier combines this model with a decision 
rule. One common rule is to pick the hypothesis that is 
most probable; this is known as the maximum a posteri¬ 
ori or MAP decision rule. The corresponding classifier, a 
Bayes classifier, is the function that assigns a class label 
y = C k for some k as follows: 


p{C k ,x i, ...,x n ) 

which can be rewritten as follows, using the chain rule 
for repeated applications of the definition of conditional 
probability: 


y= argmax p{C k )T\p{x i \C k ). 


i—1 


p(C k ,x i,... ,x n ) = p(C k ) p(xi,.. .,x n \C k ) 
= p(Ck) p{xi\C k ) p{x 2 , ■ ■ 


38.3 Parameter estimation and 
event models 


.1 c k , 


i;ior may be calculated by assuming equiproba- 


Xl) 

= p(C t ) PMC k ) p(x 2 lc k , xi ) = 1 / (number of classes)), or by 

— p(C k ) p(xi | C k ) p(x 2 \C k ,Xi) ... afi3esfi3nate ffrthfe)class probability from the 

training set (i.e., (prior for a given class) = (number of 
samples in the class) / (total number of samples)). To esti¬ 
mate the parameters for a feature’s distribution, one must 
assume a distribution or generate nonparametric models 
for the features from the training set. 181 


Now the “naive” conditional independence assumptions 
come into play: assume that each feature Fi is condition¬ 
ally independent of every other feature Fj for j ^ i , 
given the category C . This means that 


P(Xi | C k , Xj ) p{Xi | C k ) 

P(.Xi\C k , Xj , Xk) p{.Xj\Ck) 

p(xi\C k ,Xj,x k ,xi ) =p{xi\C k ) 

and so on, for i ^ j,k,l. Thus, the joint model can be 
expressed as 


The assumptions on distributions of features are called 
the event model of the Naive Bayes classifier. For discrete 
features like the ones encountered in document classifica¬ 
tion (include spam filtering), multinomial and Bernoulli 
distributions are popular. These assumptions lead to two 
distinct models, which are often confused. 1911101 

38.3.1 Gaussian naive Bayes 


p(C k \xi, ...,x n )cc p(C k , xi,...,x n ) 

ccp(C k ) p{xi\C k ) p(x 2 \C k ) p(x 3 \C k ) 

n 

& p(C k )Y\_p{xi\C k ). 

t=i 

This means that under the above independence assump¬ 
tions, the conditional distribution over the class variable 
Cis: 


When dealing with continuous data, a typical assumption 
is that the continuous values associated with each class 
are distributed according to a Gaussian distribution. For 
example, suppose the training data contain a continuous 
attribute, x . We first segment the data by the class, and 
then compute the mean and variance of x in each class. 
Fet p c be the mean of the values in x associated with 
class c, and let <j\ be the variance of the values in x asso¬ 
ciated with class c. Then, the probability distribution of 
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some value given a class, p{x = v\c) , can be computed 
by plugging v into the equation for a Normal distribution 
parameterized by p c and a'j . That is, 


p{x = v\c) 


1 

, e 2<t c 

V 2na c 


Another common technique for handling continuous val¬ 
ues is to use binning to discretize the feature values, to 
obtain a new set of Bernoulli-distributed features; some 
literature in fact suggests that this is necessary to apply 
naive Bayes, but it is not, and the discretization may throw 
away discriminative information. 141 


38.3.2 Multinomial naive Bayes 

With a multinomial event model, samples (feature vec¬ 
tors) represent the frequencies with which certain events 
have been generated by a multinomial (pi,... ,p n ) where 
Pi is the probability that event i occurs (or K such multi¬ 
nomials in the multiclass case). A feature vector x = 
(xi,..., x n ) is then a histogram, with Xi counting the 
number of times event i was observed in a particular in¬ 
stance. This is the event model typically used for doc¬ 
ument classification, with events representing the occur¬ 
rence of a word in a single document (see bag of words 
assumption). The likelihood of observing a histogram x 
is given by 


Rennie et al. discuss problems with the multinomial as¬ 
sumption in the context of document classification and 
possible ways to alleviate those problems, including the 
use of tf-idf weights instead of raw term frequencies 
and document length normalization, to produce a naive 
Bayes classifier that is competitive with support vector 
machines. 121 

38.3.3 Bernoulli naive Bayes 

In the multivariate Bernoulli event model, features are in¬ 
dependent booleans (binary variables) describing inputs. 
Like the multinomial model, this model is popular for 
document classification tasks, 191 where binary term oc¬ 
currence features are used rather than term frequencies. 
If Xi is a boolean expressing the occurrence or absence 
of the i'th term from the vocabulary, then the likelihood 
of a document given a class Ck is given by 191 


p(xic fc )=n^(i-p«) (i - ) 

i=1 

where pk, is the probability of class Ck generating the 
term Wi. This event model is especially popular for clas¬ 
sifying short texts. It has the benefit of explicitly mod¬ 
elling the absence of terms. Note that a naive Bayes clas¬ 
sifier with a Bernoulli event model is not the same as 
a multinomial NB classifier with frequency counts trun¬ 
cated to one. 


pm a)= SfL n«- 

11 i Xl ' i 

The multinomial naive Bayes classifier becomes a linear 
classifier when expressed in log-space: 121 


logp(Cfc|x) cc log ^p(Cfc) \\pki Xi j 

n 

= log p(Ck) + ^2 X i ' lo §Pfci 
2=1 

= b + w J x 

where b = log p(C k ) and w ki = log p ki . 

If a given class and feature value never occur together 
in the training data, then the frequency-based probability 
estimate will be zero. This is problematic because it will 
wipe out all information in the other probabilities when 
they are multiplied. Therefore, it is often desirable to in¬ 
corporate a small-sample correction, called pseudocount, 
in all probability estimates such that no probability is ever 
set to be exactly zero. This way of regularizing naive 
Bayes is called Laplace smoothing when the pseudocount 
is one, and Lidstone smoothing in the general case. 


38.3.4 Semi-supervised parameter estima¬ 
tion 

Given a way to train a naive Bayes classifier from labeled 
data, it’s possible to construct a semi-supervised training 
algorithm that can learn from a combination of labeled 
and unlabeled data by running the supervised learning al¬ 
gorithm in a loop: 1111 

Given a collection D = L W U of labeled sam¬ 
ples L and unlabeled samples U, start by train¬ 
ing a naive Bayes classifier on L. 

Until convergence, do: 

Predict class probabilities P(C\x) 
for all examples x in D . 

Re-train the model based on the 
probabilities (not the labels) pre¬ 
dicted in the previous step. 

Convergence is determined based on improvement to the 
model likelihood P(D\9) , where 9 denotes the parame¬ 
ters of the naive Bayes model. 

This training algorithm is an instance of the more gen¬ 
eral expectation-maximization algorithm (EM): the pre¬ 
diction step inside the loop is the E-step of EM, while the 
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re-training of naive Bayes is the M-step. The algorithm is 
formally justified by the assumption that the data are gen¬ 
erated by a mixture model, and the components of this 
mixture model are exactly the classes of the classification 
problem. 1 " 1 

38.4 Discussion 


logistic function to b + vv 1 x , or in the multiclass case, 
the softmax function. 

Discriminative classifiers have lower asymptotic error 
than generative ones; however, research by Ng and Jordan 
has shown that in some practical cases naive Bayes 
can outperform logistic regression because it reaches its 
asymptotic error faster. 1131 


Despite the fact that the far-reaching independence as¬ 
sumptions are often inaccurate, the naive Bayes classifier 
has several properties that make it surprisingly useful in 
practice. In particular, the decoupling of the class con¬ 
ditional feature distributions means that each distribution 
can be independently estimated as a one-dimensional dis¬ 
tribution. This helps alleviate problems stemming from 
the curse of dimensionality, such as the need for data 
sets that scale exponentially with the number of features. 
While naive Bayes often fails to produce a good estimate 
for the correct class probabilities, 1 12 this may not be a re¬ 
quirement for many applications. For example, the naive 
Bayes classifier will make the correct MAP decision rule 
classification so long as the correct class is more probable 
than any other class. This is true regardless of whether the 
probability estimate is slightly, or even grossly inaccurate. 
In this manner, the overall classifier can be robust enough 
to ignore serious deficiencies in its underlying naive prob¬ 
ability model. 13 ' Other reasons for the observed success 
of the naive Bayes classifier are discussed in the litera¬ 
ture cited below. 


38.5 Examples 

38.5.1 Sex classification 

Problem: classify whether a given person is a male or a 
female based on the measured features. The features in¬ 
clude height, weight, and foot size. 

Training 

Example training set below. 

The classifier created from the training set using a Gaus¬ 
sian distribution assumption would be (given variances 
are unbiased sample variances): 

Let’s say we have equiprobable classes so P(male)= 
P(female) = 0.5. This prior probability distribution might 
be based on our knowledge of frequencies in the larger 
population, or on frequency in the training set. 


38.4.1 Relation to logistic regression 


Testing 


In the case of discrete inputs (indicator or frequency fea¬ 
tures for discrete events), naive Bayes classifiers form a 
generative-discriminative pair with (multinomial) logistic 
regression classifiers: each naive Bayes classifier can be 
considered a way of fitting a probability model that op¬ 
timizes the joint likelihood p(C, x) , while logistic re¬ 
gression fits the same probability model to optimize the 
conditional p(C\x) . [13) 


Below is a sample to be classified as a male or female. 

We wish to determine which posterior is greater, male 
or female. For the classification as male the posterior is 
given by 


posterior (male) 


P(male ) p(height\male) p(weight\male) p(foots 
evidence 


The link between the two can be seen by observing that 
the decision function for naive Bayes (in the binary case) 
can be rewritten as “predict class C\ if the odds of 
p(Ci|x) exceed those of p(C 2 \x) ", Expressing this in 
log-space gives: 


For the classification as female the posterior is given by 


posterior (female) 


P(f emale) p(height\female) p{weight\f emal 

evidence 


= logp(Ci|x) - logp(C' 2 |x) > 0 

p(C 2 \x) 


The evidence (also termed normalizing constant) may be 
calculated: 


The left-hand side of this equation is the log-odds, or 
logit, the quantity predicted by the linear model that un¬ 
derlies logistic regression. Since naive Bayes is also a lin¬ 
ear model for the two “discrete” event models, it can be 
reparametrised as a linear function b + vv 1 x > 0 . Ob¬ 
taining the probabilities is then a matter of applying the 


evidence = P(male) p(height\male) p(weight\male) p(f ootsize\mal 

+P(female) p(height\f emale) p(weight\f emale) p(f ootsize\ f email 

However, given the sample the evidence is a constant and 
thus scales both posteriors equally. It therefore does not 
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affect classification and can be ignored. We now deter¬ 
mine the probability distribution for the sex of the sam¬ 
ple. 


The question that we desire to answer is: “what is the 
probability that a given document D belongs to a given 
class Cl" In other words, what is p(C\D) 1 

Now by definition 


P{male) = 0.5 

p(height|male) = -^=L= exp ( ^ 2 ^ ) ~ 1-5789 P ^ C ' > ~ p [C) 


where p, = 5.855 and a 2 = 3.5033 • 10 -2 are the pa¬ 
rameters of normal distribution which have been previ¬ 
ously determined from the training set. Note that a value 
greater than 1 is OK here - it is a probability density 
rather than a probability, because height is a continuous 
variable. 


and 


p(C\D) 


p(Dnc) 

p(D) 


Bayes’ theorem manipulates these into a statement of 
probability in terms of likelihood. 


p( weight|male) = 5.9881 • 10 6 

p(foot size|male) = 1.3112 • 10 -3 
posterior numerator (male) = their product = 6.1984- l(f 9 


p(C|D) = rt§ !,<D|C) 


P (female) =0.5 
p{ height|female) = 2.2346 • 10 _1 
p( weight|female) = 1.6789 • 1CP 2 
p(foot size|female) = 2.8669 • lCU 1 
posterior numerator (female) = their product = 5.3778-10 


Assume for the moment that there are only two mutually 
exclusive classes, S and -i S (e.g. spam and not spam), such 
that every element (email) is in either one or the other; 


p{p\s )=n*Ms) 


Since posterior numerator is greater in the female case, an£ j 
we predict the sample is female. 


38.5.2 Document classification 

Here is a worked example of naive Bayesian classifica¬ 
tion to the document classification problem. Consider 
the problem of classifying documents by their content, 
for example into spam and non-spam e-mails. Imagine 
that documents are drawn from a number of classes of 
documents which can be modelled as sets of words where 
the (independent) probability that the i-th word of a given 
document occurs in a document from class C can be writ¬ 
ten as 


p(wi\C) 

(For this treatment, we simplify things further by assum¬ 
ing that words are randomly distributed in the document 
- that is, words are not dependent on the length of the 
document, position within the document with relation to 
other words, or other document-context.) 

Then the probability that a given document D contains all 
of the words Wi , given a class C, is 

p(D\C) = n PMC) 


p(D\^S) = IJpM-S) 

i 

Using the Bayesian result above, we can write: 

p( s \ D ) = 

p {-^ s \ d )= n^c^i" 5 ') 

Dividing one by the other gives: 

P(S\D) = P(S) 

PhS\D) p(->S) ILpM - ' 5 ') 

Which can be re-factored as: 

p(S\D) = p(S) -pr p(Wj|5) 

p(-<S\D) pi-'S) p^l-iS) 

Thus, the probability ratio p(.V I D) / ptW I D) can be 
expressed in terms of a series of likelihood ratios. The 
actual probability p(.V I D) can be easily computed from 
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log (p(S I D) / p(-d> I D)) based on the observation that 
p(S I D) + p(->S I D) = 1. 

Taking the logarithm of all these ratios, we have: 


in 

p(tS\D) 


= In 


P(S) 

Pi-'S) 


+^ht 

i 


p(m\S) 

p{wi\^S) 


(This technique of "log-likelihood ratios" is a common 
technique in statistics. In the case of two mutually exclu¬ 
sive alternatives (such as this example), the conversion of 
a log-likelihood ratio to a probability takes the form of a 
sigmoid curve: see logit for details.) 

Finally, the document can be classified as follows. It is 
spam if p(S\D) > p(-<S\D) (i.e.. In > 0 ), 

otherwise it is not spam. 


38.6 See also 

• AODE 

• Bayesian spam filtering 

• Bayesian network 

• Random naive Bayes 

• Linear classifier 

• Logistic regression 

• Perceptron 

• Take-the-best heuristic 
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• Benchmark results of Naive Bayes implementations 

• Hierarchical Naive Bayes Classifiers for uncertain 
data (an extension of the Naive Bayes classifier). 

Software 

• Naive Bayes classifiers are available in many 
general-purpose machine learning and NLP pack¬ 
ages, including Apache Mahout, Mallet, NLTK, 
Orange, scikit-learn and Weka. 

• IMSL Numerical Libraries Collections of math and 
statistical algorithms available in C/C++, Fortran, 
Java and C#/.NET. Data mining routines in the 
IMSL Libraries include a Naive Bayes classifier. 

• Winnow content recommendation Open source 
Naive Bayes text classifier works with very small 
training and unbalanced training sets. High perfor¬ 
mance, C, any Unix. 

• An interactive Microsoft Excel spreadsheet Naive 
Bayes implementation using VBA (requires enabled 
macros) with viewable source code. 

• jBNC - Bayesian Network Classifier Toolbox 

• Statistical Pattern Recognition Toolbox for Matlab. 

• ifile - the first freely available (Naive) Bayesian 
mail/spam filter 

• NClassifier - NClassifier is a .NET library that sup¬ 
ports text classification and text summarization. It is 
a port of Classifier4J. 

• Classifier4J - Classifier4J is a Java library designed 
to do text classification. It comes with an implemen¬ 
tation of a Bayesian classifier. 


Chapter 39 

Cross-validation (statistics) 


Cross-validation, sometimes called rotation estima¬ 
tion , 111 121 is a model validation technique for assess¬ 
ing how the results of a statistical analysis will generalize 
to an independent data set. It is mainly used in settings 
where the goal is prediction, and one wants to estimate 
how accurately a predictive model will perform in prac¬ 
tice. In a prediction problem, a model is usually given a 
dataset of known data on which training is run (training 
dataset ), and a dataset of unknown data (or first seen data) 
against which the model is tested ( testing dataset). 141 The 
goal of cross validation is to define a dataset to “test” the 
model in the training phase (i.e., the validation dataset). 
in order to limit problems like overfitting, give an in¬ 
sight on how the model will generalize to an independent 
dataset (i.e., an unknown dataset, for instance from a real 
problem), etc. 

One round of cross-validation involves partitioning a 
sample of data into complementary subsets, performing 
the analysis on one subset (called the training set), and 
validating the analysis on the other subset (called the val¬ 
idation set or testing set). To reduce variability, multiple 
rounds of cross-validation are performed using different 
partitions, and the validation results are averaged over the 
rounds. 

Cross-validation is important in guarding against testing 
hypotheses suggested by the data (called "Type III er¬ 
rors" 1 "’ 1 ), especially where further samples are hazardous, 
costly or impossible to collect. 

Furthermore, one of the main reasons for using cross- 
validation instead of using the conventional validation 
(e.g. partitioning the data set into two sets of 70% for 
training and 30% for test) is that the error (e.g. Root 
Mean Square Error) on the training set in the conven¬ 
tional validation is not a useful estimator of model per¬ 
formance and thus the error on the test data set does not 
properly represent the assessment of model performance. 
This may be due to the fact that there is not enough data 
available or there is not a good distribution and spread of 
data to partition it into separate training and test sets in 
the conventional validation method. In these cases, a fair 
way to properly estimate model prediction performance is 
to use cross-validation as a powerful general technique. 161 

In summary, cross-validation combines (averages) mea¬ 


sures of fit (prediction error) to correct for the optimistic 
nature of training error and derive a more accurate esti¬ 
mate of model prediction performance. 161 

39.1 Purpose of cross-validation 

Suppose we have a model with one or more unknown 
parameters, and a data set to which the model can be 
fit (the training data set). The fitting process optimizes 
the model parameters to make the model fit the training 
data as well as possible. If we then take an independent 
sample of validation data from the same population as 
the training data, it will generally turn out that the model 
does not fit the validation data as well as it fits the training 
data. This is called overfitting, and is particularly likely 
to happen when the size of the training data set is small, 
or when the number of parameters in the model is large. 
Cross-validation is a way to predict the fit of a model to a 
hypothetical validation set when an explicit validation set 
is not available. 

Linear regression provides a simple illustration of overfit¬ 
ting. In linear regression we have real response values Vi, 

..., yn, and n p-dimensional vector covariates Jti, ..., xn. 
The components of the vectors x, are denoted xu, ..., x lp . 

If we use least squares to fit a function in the form of a 
hyperplane y = a + (fx to the data (x 15 y l )i< 1 < ll , we could 
then assess the fit using the mean squared error (MSE). 
The MSE for a given value of the parameters a and ft on 
the training set ( jc „ }?;)!<;<„ is 

^ n i n 

' (.Vi tx (3 X;) 'y '('y, a ftiXn • PpXip) 

n ' n ' 

i=i i=i 

It can be shown under mild assumptions that the expected 
value of the MSE for the training set is (n - p — 1 )/(n 
+ p + 1) < 1 times the expected value of the MSE for 
the validation set (the expected value is taken over the 
distribution of training sets). Thus if we fit the model 
and compute the MSE on the training set, we will get an 
optimistically biased assessment of how well the model 
will fit an independent data set. This biased estimate is 
called the in-sample estimate of the fit, whereas the cross- 
validation estimate is an out-of-sample estimate. 
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Since in linear regression it is possible to directly com¬ 
pute the factor (n - p - Y)/(n + p + 1) by which the 
training MSE underestimates the validation MSE, cross- 
validation is not practically useful in that setting (how¬ 
ever, cross-validation remains useful in the context of lin¬ 
ear regression in that it can be used to select an optimally 
regularized cost function). In most other regression pro¬ 
cedures (e.g. logistic regression), there is no simple for¬ 
mula to make such an adjustment. Cross-validation is, 
thus, a generally applicable way to predict the perfor¬ 
mance of a model on a validation set using computation 
in place of mathematical analysis. 

39.2 Common types of cross- 
validation 

Two types of cross-validation can be distinguished, ex¬ 
haustive and non-exhaustive cross-validation. 


39.2.1 Exhaustive cross-validation 

Exhaustive cross-validation methods are cross-validation 
methods which learn and test on all possible ways to di¬ 
vide the original sample into a training and a validation 
set. 


Leave-p-out cross-validation 

Leave-p-out cross-validation (LpO CV) involves using p 
observations as the validation set and the remaining ob¬ 
servations as the training set. This is repeated on all ways 
to cut the original sample on a validation set of p obser¬ 
vations and a training set. 

LpO cross-validation requires to learn and validate 6'” 
times (where n is the number of observations in the orig¬ 
inal sample). So as soon as n is quite big it becomes im¬ 
possible to calculate. (See Binomial coefficient) 


Leave-one-out cross-validation 

Leave-one-out cross-validation (LOOCV) is a particular 
case of leave-p-out cross-validation with p = 1. 

LOO cross-validation doesn't have the calculation prob¬ 
lem of general LpO cross-validation because C" = n . 


39.2.2 Non-exhaustive cross-validation 

Non-exhaustive cross validation methods do not compute 
all ways of splitting the original sample. Those methods 
are approximations of leave-p-out cross-validation. 


Mold cross-validation 

In Mold cross-validation, the original sample is randomly 
partitioned into k equal sized subsamples. Of the k sub¬ 
samples, a single subsample is retained as the validation 
data for testing the model, and the remaining k - 1 sub¬ 
samples are used as training data. The cross-validation 
process is then repeated k times (the folds), with each 
of the k subsamples used exactly once as the validation 
data. The k results from the folds can then be averaged 
(or otherwise combined) to produce a single estimation. 
The advantage of this method over repeated random sub¬ 
sampling (see below) is that all observations are used for 
both training and validation, and each observation is used 
for validation exactly once. 10-fold cross-validation is 
commonly used, 171 but in general k remains an unfixed 
parameter. 

When k=n (the number of observations), the A'-fold cross- 
validation is exactly the leave-one-out cross-validation. 

In stratified Mold cross-validation, the folds are selected 
so that the mean response value is approximately equal 
in all the folds. In the case of a dichotomous classifica¬ 
tion, this means that each fold contains roughly the same 
proportions of the two types of class labels. 

2-fold cross-validation 

This is the simplest variation of Mold cross-validation. 
Also called holdout method. 181 Lor each fold, we ran¬ 
domly assign data points to two sets do and d i, so that 
both sets are equal size (this is usually implemented by 
shuffling the data array and then splitting it in two). We 
then train on do and test on d\ , followed by training on 
d\ and testing on do. 

This has the advantage that our training and test sets are 
both large, and each data point is used for both training 
and validation on each fold. 

Repeated random sub-sampling validation 

This method randomly splits the dataset into training and 
validation data. Lor each such split, the model is fit to 
the training data, and predictive accuracy is assessed us¬ 
ing the validation data. The results are then averaged over 
the splits. The advantage of this method (over Mold cross 
validation) is that the proportion of the training/validation 
split is not dependent on the number of iterations (folds). 
The disadvantage of this method is that some observa¬ 
tions may never be selected in the validation subsample, 
whereas others may be selected more than once. In other 
words, validation subsets may overlap. This method also 
exhibits Monte Carlo variation, meaning that the results 
will vary if the analysis is repeated with different random 
splits. 

When the number of random splits goes to infinity, the 
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Repeated random sub-sampling validation become arbi¬ 
trary close to the leave-p-out cross-validation. 

In a stratified variant of this approach, the random sam¬ 
ples are generated in such a way that the mean response 
value (i.e. the dependent variable in the regression) is 
equal in the training and testing sets. This is particularly 
useful if the responses are dichotomous with an unbal¬ 
anced representation of the two response values in the 
data. 


39.3 Measures of fit 

The goal of cross-validation is to estimate the expected 
level of fit of a model to a data set that is independent of 
the data that were used to train the model. It can be used 
to estimate any quantitative measure of fit that is appro¬ 
priate for the data and model. For example, for binary 
classification problems, each case in the validation set is 
either predicted correctly or incorrectly. In this situation 
the misclassification error rate can be used to summarize 
the fit, although other measures like positive predictive 
value could also be used. When the value being predicted 
is continuously distributed, the mean squared error, root 
mean squared error or median absolute deviation could 
be used to summarize the errors. 


39.4 Applications 

Cross-validation can be used to compare the perfor¬ 
mances of different predictive modeling procedures. For 
example, suppose we are interested in optical character 
recognition, and we are considering using either support 
vector machines (SVM) or k nearest neighbors (KNN) 
to predict the true character from an image of a hand¬ 
written character. Using cross-validation, we could ob¬ 
jectively compare these two methods in terms of their re¬ 
spective fractions of misclassified characters. If we sim¬ 
ply compared the methods based on their in-sample error 
rates, the KNN method would likely appear to perform 
better, since it is more flexible and hence more prone to 
overfitting compared to the SVM method. 

Cross-validation can also be used in variable selection J 9 ' 
Suppose we are using the expression levels of 20 proteins 
to predict whether a cancer patient will respond to a drug. 
A practical goal would be to determine which subset of 
the 20 features should be used to produce the best predic¬ 
tive model. For most modeling procedures, if we com¬ 
pare feature subsets using the in-sample error rates, the 
best performance will occur when all 20 features are used. 
However under cross-validation, the model with the best 
fit will generally include only a subset of the features that 
are deemed truly informative. 


39.5 Statistical properties 

Suppose we choose a measure of fit F, and use cross- 
validation to produce an estimate F* of the expected fit 
EF of a model to an independent data set drawn from 
the same population as the training data. If we imagine 
sampling multiple independent training sets following the 
same distribution, the resulting values for F* will vary. 
The statistical properties of F* result from this variation. 

The cross-validation estimator F* is very nearly unbiased 
for EF. The reason that it is slightly biased is that the train¬ 
ing set in cross-validation is slightly smaller than the ac¬ 
tual data set (e.g. for LOOCV the training set size is n - 1 
when there are n observed cases). In nearly all situations, 
the effect of this bias will be conservative in that the esti¬ 
mated fit will be slightly biased in the direction suggesting 
a poorer fit. In practice, this bias is rarely a concern. 

The variance of F* can be large. 1 011111 For this reason, 
if two statistical procedures are compared based on the 
results of cross-validation, it is important to note that the 
procedure with the better estimated performance may not 
actually be the better of the two procedures (i.e. it may 
not have the better value of EF). Some progress has been 
made on constructing confidence intervals around cross- 
validation estimates, ^ 10] but this is considered a difficult 
problem. 

39.6 Computational issues 

Most forms of cross-validation are straightforward to im¬ 
plement as long as an implementation of the prediction 
method being studied is available. In particular, the pre¬ 
diction method need only be available as a “black box” 
- there is no need to have access to the internals of its 
implementation. If the prediction method is expensive to 
train, cross-validation can be very slow since the training 
must be carried out repeatedly. In some cases such as 
least squares and kernel regression, cross-validation can 
be sped up significantly by pre-computing certain values 
that are needed repeatedly in the training, or by using fast 
“updating rules” such as the Sherman-Morrison formula. 
However one must be careful to preserve the “total blind¬ 
ing” of the validation set from the training procedure, oth¬ 
erwise bias may result. An extreme example of acceler¬ 
ating cross-validation occurs in linear regression, where 
the results of cross-validation have a closed-form expres¬ 
sion known as the prediction residual error sum of squares 
(PRESS). 

39.7 Relationship to other forms of 
validation 

In “true validation,” or “holdout validation,” a subset of 
observations is chosen randomly from the initial sample 
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to form a validation or testing set, and the remaining ob¬ 
servations are retained as the training data. Normally, 
less than a third of the initial sample is used for valida¬ 
tion data. [ 121 


39.8 Limitations and misuse 

Cross-validation only yields meaningful results if the val¬ 
idation set and training set are drawn from the same pop¬ 
ulation and only if human biases are controlled. 

In many applications of predictive modeling, the struc¬ 
ture of the system being studied evolves over time. Both 
of these can introduce systematic differences between the 
training and validation sets. For example, if a model for 
predicting stock values is trained on data for a certain 
five-year period, it is unrealistic to treat the subsequent 
five-year period as a draw from the same population. As 
another example, suppose a model is developed to predict 
an individual’s risk for being diagnosed with a particular 
disease within the next year. If the model is trained us¬ 
ing data from a study involving only a specific population 
group (e.g. young people or males), but is then applied to 
the general population, the cross-validation results from 
the training set could differ greatly from the actual pre¬ 
dictive performance. 

In many applications, models also may be incorrectly 
specified and vary as a function of modeler biases and/or 
arbitrary choices. When this occurs, there may be an illu¬ 
sion that the system changes in external samples, whereas 
the reason is that the model has missed a critical predic¬ 
tor and/or included a confounded predictor. New evi¬ 
dence is that cross-validation by itself is not very predic¬ 
tive of external validity, whereas a form of experimen¬ 
tal validation known as swap sampling that does control 
for human bias can be much more predictive of exter¬ 
nal validity. 1 131 As defined by this large MAQC-II study 
across 30,000 models, swap sampling incorporates cross- 
validation in the sense that predictions are tested across 
independent training and validation samples. Yet, models 
are also developed across these independent samples and 
by modelers who are blinded to one another. When there 
is a mismatch in these models developed across these 
swapped training and validation samples as happens quite 
frequently, MAQC-II shows that this will be much more 
predictive of poor external predictive validity than tradi¬ 
tional cross-validation. 

The reason for the success of the swapped sampling is a 
built-in control for human biases in model building. In 
addition to placing too much faith in predictions that may 
vary across modelers and lead to poor external validity 
due to these confounding modeler effects, these are some 
other ways that cross-validation can be misused: 

• By performing an initial analysis to identify the most 
informative features using the entire data set - if 


feature selection or model tuning is required by the 
modeling procedure, this must be repeated on every 
training set. Otherwise, predictions will certainly be 
upwardly biased. 1 141 If cross-validation is used to de¬ 
cide which features to use, an inner cross-validation 
to carry out the feature selection on every training 
set must be performed. 1 131 

• By allowing some of the training data to also be in¬ 
cluded in the test set - this can happen due to “twin¬ 
ning” in the data set, whereby some exactly identical 
or nearly identical samples are present in the data 
set. Note that to some extent twinning always takes 
place even in perfectly independent training and val¬ 
idation samples. This is because some of the train¬ 
ing sample observations will have nearly identical 
values of predictors as validation sample observa¬ 
tions. And some of these will correlate with a target 
at better than chance levels in the same direction in 
both training and validation when they are actually 
driven by confounded predictors with poor external 
validity. If such a cross-validated model is selected 
from a k-fold set, human confirmation bias will be 
at work and determine that such a model has been 
validated. This is why traditional cross-validation 
needs to be supplemented with controls for human 
bias and confounded model specification like swap 
sampling and prospective studies. 

It should be noted that some statisticians have questioned 

the usefulness of validation samples. 1161 


39.9 See also 

• Boosting (machine learning) 

• Bootstrap aggregating (bagging) 

• Bootstrapping (statistics) 

• Resampling (statistics) 
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Chapter 40 

Unsupervised learning 


In machine learning, the problem of unsupervised 
learning is that of trying to find hidden structure in un¬ 
labeled data. Since the examples given to the learner are 
unlabeled, there is no error or reward signal to evaluate a 
potential solution. This distinguishes unsupervised learn¬ 
ing from supervised learning and reinforcement learning. 

Unsupervised learning is closely related to the problem 
of density estimation in statistics. 111 However unsuper¬ 
vised learning also encompasses many other techniques 
that seek to summarize and explain key features of the 
data. Many methods employed in unsupervised learn¬ 
ing are based on data mining methods used to preprocess 
data. 

Approaches to unsupervised learning include: 

• clustering (e.g., k-means, mixture models, 
hierarchical clustering), 121 

• Approaches for learning latent variable models such 
as 

• Expectation-maximization algorithm (EM) 

• Method of moments 

• Blind signal separation techniques, e.g., 

• Principal component analysis, 

• Independent component analysis, 

• Non-negative matrix factorization, 

• Singular value decomposition. 131 

Among neural network models, the self-organizing map 
(SOM) and adaptive resonance theory (ART) are com¬ 
monly used unsupervised learning algorithms. The SOM 
is a topographic organization in which nearby locations 
in the map represent inputs with similar properties. The 
ART model allows the number of clusters to vary with 
problem size and lets the user control the degree of sim¬ 
ilarity between members of the same clusters by means 
of a user-defined constant called the vigilance parameter. 
ART networks are also used for many pattern recognition 
tasks, such as automatic target recognition and seismic 
signal processing. The first version of ART was “ART1”, 
developed by Carpenter and Grossberg (1988). 141 


40.1 Method of moments 

One of the approaches in unsupervised learning is the 
method of moments. In the method of moments, the un¬ 
known parameters (of interest) in the model are related 
to the moments of one or more random variables, and 
thus, these unknown parameters can be estimated given 
the moments. The moments are usually estimated from 
samples in an empirical way. The basic moments are first 
and second order moments. For a random vector, the 
first order moment is the mean vector, and the second 
order moment is the covariance matrix (when the mean 
is zero). Higher order moments are usually represented 
using tensors which are the generalization of matrices to 
higher orders as multi-dimensional arrays. 

In particular, the method of moments is shown to be ef¬ 
fective in learning the parameters of latent variable mod¬ 
els. 151 Latent variable models are statistical models where 
in addition to the observed variables, a set of latent vari¬ 
ables also exists which is not observed. A highly practical 
example of latent variable models in machine learning is 
the topic modeling which is a statistical model for gen¬ 
erating the words (observed variables) in the document 
based on the topic (latent variable) of the document. In 
the topic modeling, the words in the document are gen¬ 
erated according to different statistical parameters when 
the topic of the document is changed. It is shown that 
method of moments (tensor decomposition techniques) 
consistently recover the parameters of a large class of la¬ 
tent variable models under some assumptions. 151 

Expectation-maximization algorithm (EM) is also one of 
the most practical methods for learning latent variable 
models. But, it can be stuck in local optima, and the 
global convergence of the algorithm to the true unknown 
parameters of the model is not guaranteed. While, for 
the method of moments, the global convergence is guar¬ 
anteed under some conditions. 151 


40.2 See also 

• Cluster analysis 

• Expectation-maximization algorithm 
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• Generative topographic map 

• Multilinear subspace learning 

• Multivariate analysis 

• Radial basis function network 
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Chapter 41 

Cluster analysis 


For the supervised learning approach, see Statistical clas¬ 
sification. 

Cluster analysis or clustering is the task of grouping 



The result of a cluster analysis shown as the coloring of the 
squares into three clusters. 


ing and model parameters until the result achieves the de¬ 
sired properties. 

Besides the term clustering , there are a number of terms 
with similar meanings, including automatic classification , 
numerical taxonomy, botryology (from Greek Poxpuq 
“grape”) and typological analysis. The subtle differences 
are often in the usage of the results: while in data min¬ 
ing, the resulting groups are the matter of interest, in au¬ 
tomatic classification the resulting discriminative power 
is of interest. This often leads to misunderstandings be¬ 
tween researchers coming from the fields of data mining 
and machine learning, since they use the same terms and 
often the same algorithms, but have different goals. 

Cluster analysis was originated in anthropology by Driver 
and Kroeber in 1932 and introduced to psychology by Zu¬ 
bin in 1938 and Robert Tryon in 1939 [1][2] and famously 
used by Cattell beginning in 1943 131 for trait theory clas¬ 
sification in personality psychology. 


a set of objects in such a way that objects in the same 
group (called a cluster) are more similar (in some sense 
or another) to each other than to those in other groups 
(clusters). It is a main task of exploratory data min¬ 
ing, and a common technique for statistical data analysis, 
used in many fields, including machine learning, pattern 
recognition, image analysis, information retrieval, and 
bioinformatics. 

Cluster analysis itself is not one specific algorithm, but 
the general task to be solved. It can be achieved by var¬ 
ious algorithms that differ significantly in their notion 
of what constitutes a cluster and how to efficiently find 
them. Popular notions of clusters include groups with 
small distances among the cluster members, dense ar¬ 
eas of the data space, intervals or particular statistical 
distributions. Clustering can therefore be formulated as 
a multi-objective optimization problem. The appropri¬ 
ate clustering algorithm and parameter settings (includ¬ 
ing values such as the distance function to use, a density 
threshold or the number of expected clusters) depend on 
the individual data set and intended use of the results. 
Cluster analysis as such is not an automatic task, but an 
iterative process of knowledge discovery or interactive 
multi-objective optimization that involves trial and fail¬ 
ure. It will often be necessary to modify data preprocess- 


41.1 Definition 

According to Vladimir Estivill-Castro, the notion of a 
“cluster” cannot be precisely defined, which is one of the 
reasons why there are so many clustering algorithms. 141 
There is a common denominator: a group of data ob¬ 
jects. However, different researchers employ different 
cluster models, and for each of these cluster models again 
different algorithms can be given. The notion of a clus¬ 
ter, as found by different algorithms, varies significantly 
in its properties. Understanding these “cluster models” is 
key to understanding the differences between the various 
algorithms. Typical cluster models include: 

• Connectivity models: for example hierarchical clus¬ 
tering builds models based on distance connectivity. 

• Centroid models: for example the k-means algo¬ 
rithm represents each cluster by a single mean vec¬ 
tor. 

• Distribution models: clusters are modeled using sta¬ 
tistical distributions, such as multivariate normal 
distributions used by the Expectation-maximization 
algorithm. 
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• Density models: for example DBSCAN and 
OPTICS defines clusters as connected dense regions 
in the data space. 

• Subspace models: in Biclustering (also known as 
Co-clustering or two-mode-clustering), clusters are 
modeled with both cluster members and relevant at¬ 
tributes. 

• Group models: some algorithms do not provide a 
refined model for their results and just provide the 
grouping information. 

• Graph-based models: a clique, i.e., a subset of nodes 
in a graph such that every two nodes in the subset are 
connected by an edge can be considered as a proto¬ 
typical form of cluster. Relaxations of the complete 
connectivity requirement (a fraction of the edges can 
be missing) are known as quasi-cliques, as in HCS 
clustering algorithm . 

A “clustering” is essentially a set of such clusters, usually 
containing all objects in the data set. Additionally, it may 
specify the relationship of the clusters to each other, for 
example a hierarchy of clusters embedded in each other. 
Clusterings can be roughly distinguished as: 

• hard clustering: each object belongs to a cluster or 
not 

• soft clustering (also: fuzzy clustering): each object 
belongs to each cluster to a certain degree (e.g. a 
likelihood of belonging to the cluster) 

There are also finer distinctions possible, for example: 

• strict partitioning clustering: here each object be¬ 
longs to exactly one cluster 

• strict partitioning clustering with outliers: objects 
can also belong to no cluster, and are considered 
outliers. 

• overlapping clustering (also: alternative clustering, 
multi-view clustering): while usually a hard cluster¬ 
ing, objects may belong to more than one cluster. 

• hierarchical clustering: objects that belong to a child 
cluster also belong to the parent cluster 

• subspace clustering: while an overlapping cluster¬ 
ing, within a uniquely defined subspace, clusters are 
not expected to overlap. 

41.2 Algorithms 

Main category: Data clustering algorithms 


Clustering algorithms can be categorized based on their 
cluster model, as listed above. The following overview 
will only list the most prominent examples of clustering 
algorithms, as there are possibly over 100 published clus¬ 
tering algorithms. Not all provide models for their clus¬ 
ters and can thus not easily be categorized. An overview 
of algorithms explained in Wikipedia can be found in the 
list of statistics algorithms. 

There is no objectively “correct” clustering algorithm, 
but as it was noted, “clustering is in the eye of the 
beholder.” 141 The most appropriate clustering algorithm 
for a particular problem often needs to be chosen exper¬ 
imentally, unless there is a mathematical reason to prefer 
one cluster model over another. It should be noted that 
an algorithm that is designed for one kind of model has 
no chance on a data set that contains a radically differ¬ 
ent kind of model. 141 For example, k-means cannot find 
non-convex clusters. 141 

41.2.1 Connectivity based clustering (hier¬ 
archical clustering) 

Main article: Hierarchical clustering 

Connectivity based clustering, also known as hierarchical 
clustering, is based on the core idea of objects being 
more related to nearby objects than to objects farther 
away. These algorithms connect “objects” to form “clus¬ 
ters” based on their distance. A cluster can be described 
largely by the maximum distance needed to connect parts 
of the cluster. At different distances, different clusters 
will form, which can be represented using a dendrogram, 
which explains where the common name “hierarchical 
clustering” comes from: these algorithms do not provide 
a single partitioning of the data set, but instead provide 
an extensive hierarchy of clusters that merge with each 
other at certain distances. In a dendrogram, the y-axis 
marks the distance at which the clusters merge, while the 
objects are placed along the x-axis such that the clusters 
don't mix. 

Connectivity based clustering is a whole family of meth¬ 
ods that differ by the way distances are computed. Apart 
from the usual choice of distance functions, the user also 
needs to decide on the linkage criterion (since a clus¬ 
ter consists of multiple objects, there are multiple candi¬ 
dates to compute the distance to) to use. Popular choices 
are known as single-linkage clustering (the minimum of 
object distances), complete linkage clustering (the maxi¬ 
mum of object distances) or UPGMA (“Unweighted Pair 
Group Method with Arithmetic Mean”, also known as av¬ 
erage linkage clustering). Furthermore, hierarchical clus¬ 
tering can be agglomerative (starting with single elements 
and aggregating them into clusters) or divisive (starting 
with the complete data set and dividing it into partitions). 

These methods will not produce a unique partitioning of 
the data set, but a hierarchy from which the user still 
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needs to choose appropriate clusters. They are not very 
robust towards outliers, which will either show up as ad¬ 
ditional clusters or even cause other clusters to merge 
(known as “chaining phenomenon”, in particular with 
single-linkage clustering). In the general case, the com¬ 
plexity is 0(n 3 ) , which makes them too slow for large 
data sets. For some special cases, optimal efficient meth¬ 
ods (of complexity 0(n 2 ) ) are known: SLINK 151 for 
single-linkage and CLINK 161 for complete-linkage clus¬ 
tering. In the data mining community these methods are 
recognized as a theoretical foundation of cluster analysis, 
but often considered obsolete. They did however provide 
inspiration for many later methods such as density based 
clustering. 

• Linkage clustering examples 

• Single-linkage on Gaussian data. At 35 clusters, the 
biggest cluster starts fragmenting into smaller parts, 
while before it was still connected to the second 
largest due to the single-link effect. 

• Single-linkage on density-based clusters. 20 clusters 
extracted, most of which contain single elements, 
since linkage clustering does not have a notion of 
“noise”. 


41.2.2 Centroid-based clustering 

Main article: k-means clustering 

In centroid-based clustering, clusters are represented by 
a central vector, which may not necessarily be a mem¬ 
ber of the data set. When the number of clusters is fixed 
to k, /c-means clustering gives a formal definition as an 
optimization problem: find the k cluster centers and as¬ 
sign the objects to the nearest cluster center, such that the 
squared distances from the cluster are minimized. 

The optimization problem itself is known to be NP-hard, 
and thus the common approach is to search only for ap¬ 
proximate solutions. A particularly well known approxi¬ 
mative method is Lloyd’s algorithm, 171 often actually re¬ 
ferred to as "k-means algorithm". It does however only 
find a local optimum, and is commonly run multiple times 
with different random initializations. Variations of k- 
means often include such optimizations as choosing the 
best of multiple runs, but also restricting the centroids to 
members of the data set (k-medoids), choosing medians 
(k-medians clustering), choosing the initial centers less 
randomly (K-means++) or allowing a fuzzy cluster as¬ 
signment (Fuzzy c-means). 

Most k-means-type algorithms require the number of 
clusters - k - to be specified in advance, which is con¬ 
sidered to be one of the biggest drawbacks of these al¬ 
gorithms. Furthermore, the algorithms prefer clusters of 
approximately similar size, as they will always assign an 


object to the nearest centroid. This often leads to incor¬ 
rectly cut borders in between of clusters (which is not sur¬ 
prising, as the algorithm optimized cluster centers, not 
cluster borders). 

K-means has a number of interesting theoretical prop¬ 
erties. On the one hand, it partitions the data space 
into a structure known as a Voronoi diagram. On the 
other hand, it is conceptually close to nearest neighbor 
classification, and as such is popular in machine learn¬ 
ing. Third, it can be seen as a variation of model based 
classification, and Lloyd’s algorithm as a variation of the 
Expectation-maximization algorithm for this model dis¬ 
cussed below. 

• k-Means clustering examples 

• K-means separates data into Voronoi-cells, which 
assumes equal-sized clusters (not adequate here) 

• K-means cannot represent density-based clusters 

41.2.3 Distribution-based clustering 

The clustering model most closely related to statistics is 
based on distribution models. Clusters can then easily be 
defined as objects belonging most likely to the same dis¬ 
tribution. A convenient property of this approach is that 
this closely resembles the way artificial data sets are gen¬ 
erated: by sampling random objects from a distribution. 

While the theoretical foundation of these methods is 
excellent, they suffer from one key problem known as 
overfitting, unless constraints are put on the model com¬ 
plexity. A more complex model will usually be able to 
explain the data better, which makes choosing the appro¬ 
priate model complexity inherently difficult. 

One prominent method is known as Gaussian mixture 
models (using the expectation-maximization algorithm). 
Here, the data set is usually modelled with a fixed (to 
avoid overfitting) number of Gaussian distributions that 
are initialized randomly and whose parameters are iter¬ 
atively optimized to fit better to the data set. This will 
converge to a local optimum, so multiple runs may pro¬ 
duce different results. In order to obtain a hard clustering, 
objects are often then assigned to the Gaussian distribu¬ 
tion they most likely belong to; for soft clusterings, this is 
not necessary. 

Distribution-based clustering produces complex models 
for clusters that can capture correlation and dependence 
between attributes. However, these algorithms put an ex¬ 
tra burden on the user: for many real data sets, there may 
be no concisely defined mathematical model (e.g. assum¬ 
ing Gaussian distributions is a rather strong assumption 
on the data). 

• Expectation-Maximization (EM) clustering exam¬ 
ples 
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• On Gaussian-distributed data, EM works well, since 
it uses Gaussians for modelling clusters 

• Density-based clusters cannot be modeled using 
Gaussian distributions 


41.2.4 Density-based clustering 

In density-based clustering, 181 clusters are defined as areas 
of higher density than the remainder of the data set. Ob¬ 
jects in these sparse areas - that are required to separate 
clusters - are usually considered to be noise and border 
points. 

The most popular 191 density based clustering method is 
DBSCAN. 1 ln| In contrast to many newer methods, it 
features a well-defined cluster model called “density- 
reachability”. Similar to linkage based clustering, it is 
based on connecting points within certain distance thresh¬ 
olds. However, it only connects points that satisfy a den¬ 
sity criterion, in the original variant defined as a minimum 
number of other objects within this radius. A cluster con¬ 
sists of all density-connected objects (which can form a 
cluster of an arbitrary shape, in contrast to many other 
methods) plus all objects that are within these objects’ 
range. Another interesting property of DBSCAN is that 
its complexity is fairly low - it requires a linear number 
of range queries on the database - and that it will dis¬ 
cover essentially the same results (it is deterministic for 
core and noise points, but not for border points) in each 
run, therefore there is no need to run it multiple times. 
OPTICS 1 111 is a generalization of DBSCAN that removes 
the need to choose an appropriate value for the range pa¬ 
rameter e , and produces a hierarchical result related to 
that of linkage clustering. DeLi-Clu, 1121 Density-Link- 
Clustering combines ideas from single-linkage clustering 
and OPTICS, eliminating the e parameter entirely and of¬ 
fering performance improvements over OPTICS by using 
an R-tree index. 

The key drawback of DBSCAN and OPTICS is that they 
expect some kind of density drop to detect cluster bor¬ 
ders. Moreover, they cannot detect intrinsic cluster struc¬ 
tures which are prevalent in the majority of real life data. 
A variation of DBSCAN, EnDBSCAN, 1131 efficiently de¬ 
tects such kinds of structures. On data sets with, for ex¬ 
ample, overlapping Gaussian distributions - a common 
use case in artificial data - the cluster borders produced 
by these algorithms will often look arbitrary, because 
the cluster density decreases continuously. On a data set 
consisting of mixtures of Gaussians, these algorithms are 
nearly always outperformed by methods such as EM clus¬ 
tering that are able to precisely model this kind of data. 

Mean-shift is a clustering approach where each object is 
moved to the densest area in its vicinity, based on kernel 
density estimation. Eventually, objects converge to local 
maxima of density. Similar to k-means clustering, these 
“density attractors” can serve as representatives for the 


data set, but mean-shift can detect arbitrary-shaped clus¬ 
ters similar to DBSCAN. Due to the expensive iterative 
procedure and density estimation, mean-shift is usually 
slower than DBSCAN or k-Means. 

• Density-based clustering examples 

• Density-based clustering with DBSCAN. 

• DBSCAN assumes clusters of similar density, and 
may have problems separating nearby clusters 

• OPTICS is a DBSCAN variant that handles different 
densities much better 

41.2.5 Recent developments 

In recent years considerable effort has been put into im¬ 
proving the performance of existing algorithms. 11411131 
Among them are CLARANS (Ng and Han, 1994), 1161 and 
BIRCH (Zhang et al., 1996). 1171 With the recent need to 
process larger and larger data sets (also known as big 
data), the willingness to trade semantic meaning of the 
generated clusters for performance has been increasing. 
This led to the development of pre-clustering methods 
such as canopy clustering, which can process huge data 
sets efficiently, but the resulting “clusters” are merely a 
rough pre-partitioning of the data set to then analyze the 
partitions with existing slower methods such as k-means 
clustering. Various other approaches to clustering have 
been tried such as seed based clustering. 1181 

For high-dimensional data, many of the existing meth¬ 
ods fail due to the curse of dimensionality, which ren¬ 
ders particular distance functions problematic in high¬ 
dimensional spaces. This led to new clustering algorithms 
for high-dimensional data that focus on subspace clus¬ 
tering (where only some attributes are used, and cluster 
models include the relevant attributes for the cluster) and 
correlation clustering that also looks for arbitrary rotated 
(“correlated”) subspace clusters that can be modeled by 
giving a correlation of their attributes. Examples for such 
clustering algorithms are CLIQUE 1191 and SUBCLU. 1201 

Ideas from density-based clustering methods (in partic¬ 
ular the DBSCAN/OPTICS family of algorithms) have 
been adopted to subspace clustering (HiSC, 1211 hierar¬ 
chical subspace clustering and DiSH 1221 ) and correlation 
clustering (HiCO, 1231 hierarchical correlation clustering, 
4C 1241 using “correlation connectivity” and ERiC 1251 ex¬ 
ploring hierarchical density-based correlation clusters). 

Several different clustering systems based on mutual in¬ 
formation have been proposed. One is Marina Meila's 
variation of information metric; 1261 another provides hi¬ 
erarchical clustering. 1271 Using genetic algorithms, a wide 
range of different fit-functions can be optimized, includ¬ 
ing mutual information. 1281 Also message passing algo¬ 
rithms, a recent development in Computer Science and 
Statistical Physics, has led to the creation of new types of 
clustering algorithms. 1291 


41.3. EVALUATION AND ASSESSMENT 


277 


41.2.6 Other methods 

• Basic sequential algorithmic scheme (BSAS) 


41.3 Evaluation and assessment 

Evaluation of clustering results sometimes is referred to 
as cluster validation. 

There have been several suggestions for a measure of sim¬ 
ilarity between two clusterings. Such a measure can be 
used to compare how well different data clustering al¬ 
gorithms perform on a set of data. These measures are 
usually tied to the type of criterion being considered in 
assessing the quality of a clustering method. 


41.3.1 Internal evaluation 

When a clustering result is evaluated based on the data 
that was clustered itself, this is called internal evaluation. 
These methods usually assign the best score to the algo¬ 
rithm that produces clusters with high similarity within a 
cluster and low similarity between clusters. One draw¬ 
back of using internal criteria in cluster evaluation is that 
high scores on an internal measure do not necessarily re¬ 
sult in effective information retrieval applications. 130 ' Ad¬ 
ditionally, this evaluation is biased towards algorithms 
that use the same cluster model. For example k-Means 
clustering naturally optimizes object distances, and a 
distance-based internal criterion will likely overrate the 
resulting clustering. 

Therefore, the internal evaluation measures are best 
suited to get some insight into situations where one al¬ 
gorithm performs better than another, but this shall not 
imply that one algorithm produces more valid results than 
another. |4 ' Validity as measured by such an index depends 
on the claim that this kind of structure exists in the data 
set. An algorithm designed for some kind of models has 
no chance if the data set contains a radically different set 
of models, or if the evaluation measures a radically dif¬ 
ferent criterion.' 4 ' For example, k-means clustering can 
only find convex clusters, and many evaluation indexes 
assume convex clusters. On a data set with non-convex 
clusters neither the use of k-means, nor of an evaluation 
criterion that assumes convexity, is sound. 

The following methods can be used to assess the quality 
of clustering algorithms based on internal criterion: 

• Davies-Bouldin index 


The Davies-Bouldin index can be calculated by 
the following formula: 


DB = t Eti max ^ 


where n is the number of clusters, c x is the 
centroid of cluster x , a x is the average dis¬ 


tance of all elements in cluster x to centroid 
c x , and d(c L . Cj ) is the distance between cen¬ 
troids c,; and Cj . Since algorithms that pro¬ 
duce clusters with low intra-cluster distances 
(high intra-cluster similarity) and high inter¬ 
cluster distances (low inter-cluster similarity) 
will have a low Davies-Bouldin index, the clus¬ 
tering algorithm that produces a collection of 
clusters with the smallest Davies-Bouldin in¬ 
dex is considered the best algorithm based on 
this criterion. 

• Dunn index 

The Dunn index aims to identify dense and 
well-separated clusters. It is defined as the ratio 
between the minimal inter-cluster distance to 
maximal intra-cluster distance. For each clus¬ 
ter partition, the Dunn index can be calculated 
by the following formula:' 31 ' 

jj _ mini <i<j<n d(i,j) 

ma xi<k< n d'(k) ’ 

where d(ij) represents the distance between 
clusters i and j, and d \k) measures the intra¬ 
cluster distance of cluster k. The inter-cluster 
distance d(ij) between two clusters may be 
any number of distance measures, such as the 
distance between the centroids of the clusters. 
Similarly, the intra-cluster distance d'(k) may 
be measured in a variety ways, such as the max¬ 
imal distance between any pair of elements in 
cluster k. Since internal criterion seek clusters 
with high intra-cluster similarity and low inter¬ 
cluster similarity, algorithms that produce clus¬ 
ters with high Dunn index are more desirable. 

• Silhouette coefficient 

The silhouette coefficient contrasts the average 
distance to elements in the same cluster with 
the average distance to elements in other clus¬ 
ters. Objects with a high silhouette value are 
considered well clustered, objects with a low 
value may be outliers. This index works well 
with k-means clustering, and is also used to de¬ 
termine the optimal number of clusters. 


41.3.2 External evaluation 

In external evaluation, clustering results are evaluated 
based on data that was not used for clustering, such 
as known class labels and external benchmarks. Such 
benchmarks consist of a set of pre-classified items, and 
these sets are often created by human (experts). Thus, the 
benchmark sets can be thought of as a gold standard for 
evaluation. These types of evaluation methods measure 
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how close the clustering is to the predetermined bench¬ 
mark classes. However, it has recently been discussed 
whether this is adequate for real data, or only on syn¬ 
thetic data sets with a factual ground truth, since classes 
can contain internal structure, the attributes present may 
not allow separation of clusters or the classes may contain 
anomalies. [32 ' Additionally, from a knowledge discovery 
point of view, the reproduction of known knowledge may 
not necessarily be the intended result. 1321 

A number of measures are adapted from variants used 
to evaluate classification tasks. In place of counting the 
number of times a class was correctly assigned to a sin¬ 
gle data point (known as true positives), such pair count¬ 
ing metrics assess whether each pair of data points that 
is truly in the same cluster is predicted to be in the same 
cluster. 

Some of the measures of quality of a cluster algorithm 
using external criterion include: 

• Rand measure (William M. Rand 1971) [331 

The Rand index computes how similar the clus¬ 
ters (returned by the clustering algorithm) are 
to the benchmark classifications. One can also 
view the Rand index as a measure of the per¬ 
centage of correct decisions made by the algo¬ 
rithm. It can be computed using the following 
formula: 

nr TP+TN 

1X1 TP+FP+FN+TN 

where TP is the number of true positives, TN 
is the number of true negatives, FP is the num¬ 
ber of false positives, and FN is the number 
of false negatives. One issue with the Rand in¬ 
dex is that false positives and false negatives are 
equally weighted. This may be an undesirable 
characteristic for some clustering applications. 

The F-measure addresses this concern, as does 
the chance-corrected adjusted Rand index. 

• F-measure 


when j3 = 0 , and increasing B allocates an in¬ 
creasing amount of weight to recall in the final 
F-measure. 

• Jaccard index 

The Jaccard index is used to quantify the simi¬ 
larity between two datasets. The Jaccard index 
takes on a value between 0 and 1. An index of 
1 means that the two dataset are identical, and 
an index of 0 indicates that the datasets have 
no common elements. The Jaccard index is de¬ 
fined by the following formula: 

J(A ft) - M nB l - TP 

— |^ UB | — TP+FP+FN 

This is simply the number of unique elements 
common to both sets divided by the total num¬ 
ber of unique elements in both sets. 

• Fowlkes-Mallows index (E. B. Fowlkes & C. L. 
Mallows 1983) [34] 

The Fowlkes-Mallows index computes the sim¬ 
ilarity between the clusters returned by the 
clustering algorithm and the benchmark classi¬ 
fications. The higher the value of the Fowlkes- 
Mallows index the more similar the clusters 
and the benchmark classifications are. It can 
be computed using the following formula: 

TP TP 

TP+FP ' TP+FN 

where TP is the number of true positives, FP 
is the number of false positives, and FN is the 
number of false negatives. The FM index is 
the geometric mean of the precision and recall 
P and R , while the F-measure is their har¬ 
monic mean. 1351 Moreover, precision and recall 
are also known as Wallace’s indices B 1 and 
B 11 , [36] 


The F-measure can be used to balance the con¬ 
tribution of false negatives by weighting recall 
through a parameter /3 > 0 . Let precision and 
recall be defined as follows: 


P = 


TP 

TP+FP 


R = 


TP 

TP+FN 


where P is the precision rate and R is the recall 
rate. We can calculate the F-measure by using 
the following formula: 1301 


Fp 


(0 2 + l)-P-R 
+ 2 -P+R 


Notice that when /3 = 0 , F 0 = P . In other 
words, recall has no impact on the F-measure 


• The Mutual Information is an information theo¬ 
retic measure of how much information is shared 
between a clustering and a ground-truth classifica¬ 
tion that can detect a non-linear similarity between 
two clusterings. Adjusted mutual information is the 
corrected-for-chance variant of this that has a re¬ 
duced bias for varying cluster numbers. 


• Confusion matrix 


A confusion matrix can be used to quickly vi¬ 
sualize the results of a classification (or cluster¬ 
ing) algorithm. It shows how different a cluster 
is from the gold standard cluster. 
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41.4 Applications 

41.5 See also 

41.5.1 Specialized types of cluster analysis 

Others 
Social science 
Computer science 
World wide web 
Business and marketing 

Medicine 

Biology, computational biology and bioinformatics 

Plant and animal ecologycluster analysis is used to 
describe and to make spatial and temporal com¬ 
parisons of communities (assemblages) of or¬ 
ganisms in heterogeneous environments; it is 
also used in plant systematics to generate artifi- 
cial phylogenies or clusters of organisms (indi¬ 
viduals) at the species, genus or higher level that 
share a number of attributes 

Transcriptomicsclustering is used to build groups 
of genes with related expression patterns (also 
known as coexpressed genes) as in HCS cluster¬ 
ing algorithm . Often such groups contain func¬ 
tionally related proteins, such as enzymes for a 
specific pathway, or genes that are co-regulated. 

High throughput experiments using expressed Medical imaging 
sequence tags (ESTs) or DNA microarrays can 
be a powerful tool for genome annotation, a gen¬ 
eral aspect of genomics. 

Sequence analysisclustering is used to group homol¬ 
ogous sequences into gene families. This is a 
very important concept in bioinformatics, and 
evolutionary biology in general. See evolution 
by gene duplication. 

High-throughput genotyping platformsclustering al¬ 
gorithms are used to automatically assign geno- Market research 
types. 

Human genetic clusteringThe similarity of genetic 
data is used in clustering to infer population 
structures. 

On PET scans, cluster analysis can be used to 
differentiate between different types of tissue 
and blood in a three-dimensional image. In this 

application, actual position does not matter, Social network analysis 

but the voxel intensity is considered as a vector, 

with a dimension for each image that was taken 

over time. This technique allows, for example, 

accurate measurement of the rate a radioactive 

tracer is delivered to the area of interest, 

without a separate sampling of arterial blood, 

an intrusive technique that is most common Software evolution 

today. 

Analysis of antimicrobial activityCluster analysis 
can be used to analyse patterns of antibiotic 
resistance, to classify antimicrobial compounds 
according to their mechanism of action, to clas¬ 
sify antibiotics according to their antibacterial 
activity. 

IMRT segmentationclustering can be used to divide 
a fluence map into distinct regions for conver¬ 
sion into deliverable fields in MLC-based Radi- _ . , . 

.. r „, Crime analysis 

ation Therapy. 
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Clustering high-dimensional data 

• Conceptual clustering 

• Consensus clustering 

• Constrained clustering 

• Data stream clustering 

• Sequence clustering 

• Spectral clustering 

• HCS clustering 

41.5.2 Techniques used in cluster analysis 

• Artificial neural network (ANN) 

• Nearest neighbor search 

• Neighbourhood components analysis 

• Latent class analysis 

41.5.3 Data projection and preprocessing 

• Dimension reduction 

• Principal component analysis 

• Multidimensional scaling 

41.5.4 Other 

• Cluster-weighted modeling 

• Curse of dimensionality 

• Determining the number of clusters in a data set 

• Parallel coordinates 

• Structured data analysis 
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41.7 External links 

• Data Mining at DMOZ 


Chapter 42 

Expectation-maximization algorithm 


In statistics, an expectation-maximization (EM) algo¬ 
rithm is an iterative method for finding maximum like¬ 
lihood or maximum a posteriori (MAP) estimates of 
parameters in statistical models, where the model de¬ 
pends on unobserved latent variables. The EM iteration 
alternates between performing an expectation (E) step, 
which creates a function for the expectation of the log- 
likelihood evaluated using the current estimate for the 
parameters, and a maximization (M) step, which com¬ 
putes parameters maximizing the expected log-likelihood 
found on the E step. These parameter-estimates are then 
used to determine the distribution of the latent variables 
in the next E step. 

Delay 



EM clustering of Old Faithful eruption data. The random initial 
model (which, due to the different scales of the axes, appears to 
be two very flat and wide spheres) is fit to the observed data. 
In the first iterations, the model changes substantially, but then 
converges to the two modes of the geyser. Visualized using ELKI. 


42.1 History 

The EM algorithm was explained and given its name in 
a classic 1977 paper by Arthur Dempster, Nan Laird, 
and Donald Rubin. 111 They pointed out that the method 
had been “proposed many times in special circum¬ 
stances” by earlier authors. In particular, a very de¬ 


tailed treatment of the EM method for exponential fam¬ 
ilies was published by Rolf Sundberg in his thesis and 
several papers 121 131141 following his collaboration with Per 
Martin-Lof and Anders Martin-Lof. 15116117118119111011111 
The Dempster-Laird-Rubin paper in 1977 generalized 
the method and sketched a convergence analysis for a 
wider class of problems. Regardless of earlier inventions, 
the innovative Dempster-Laird-Rubin paper in the Jour¬ 
nal of the Royal Statistical Society received an enthusiastic 
discussion at the Royal Statistical Society meeting with 
Sundberg calling the paper “brilliant”. The Dempster- 
Laird-Rubin paper established the EM method as an im¬ 
portant tool of statistical analysis. 

The convergence analysis of the Dempster-Laird-Rubin 
paper was flawed and a correct convergence analysis 
was published by C.F. Jeff Wu in 1983. 1121 Wu’s proof 
established the EM method’s convergence outside of 
the exponential family, as claimed by Dempster-Laird- 
Rubin. 1131 


42.2 Introduction 

The EM algorithm is used to find (locally) maximum like¬ 
lihood parameters of a statistical model in cases where 
the equations cannot be solved directly. Typically these 
models involve latent variables in addition to unknown 
parameters and known data observations. That is, either 
there are missing values among the data, or the model 
can be formulated more simply by assuming the exis¬ 
tence of additional unobserved data points. For exam¬ 
ple, a mixture model can be described more simply by 
assuming that each observed data point has a correspond¬ 
ing unobserved data point, or latent variable, specifying 
the mixture component that each data point belongs to. 

Finding a maximum likelihood solution typically requires 
taking the derivatives of the likelihood function with re¬ 
spect to all the unknown values — viz. the parameters 
and the latent variables — and simultaneously solving the 
resulting equations. In statistical models with latent vari¬ 
ables, this usually is not possible. Instead, the result is 
typically a set of interlocking equations in which the so¬ 
lution to the parameters requires the values of the latent 
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variables and vice versa, but substituting one set of equa¬ 
tions into the other produces an unsolvable equation. 

The EM algorithm proceeds from the observation that the 
following is a way to solve these two sets of equations nu¬ 
merically. One can simply pick arbitrary values for one 
of the two sets of unknowns, use them to estimate the 
second set, then use these new values to find a better es¬ 
timate of the first set, and then keep alternating between 
the two until the resulting values both converge to fixed 
points. It’s not obvious that this will work at all, but in fact 
it can be proven that in this particular context it does, and 
that the derivative of the likelihood is (arbitrarily close 
to) zero at that point, which in turn means that the point 
is either a maximum or a saddle point. In general there 
may be multiple maxima, and there is no guarantee that 
the global maximum will be found. Some likelihoods also 
have singularities in them, i.e. nonsensical maxima. For 
example, one of the “solutions” that may be found by EM 
in a mixture model involves setting one of the compo¬ 
nents to have zero variance and the mean parameter for 
the same component to be equal to one of the data points. 

42.3 Description 

Given a statistical model which generates a set X of ob¬ 
served data, a set of unobserved latent data or missing 
values Z , and a vector of unknown parameters 6 , along 
with a likelihood function L(6 ; X, Z) = p(X. Z| 0) , the 
maximum likelihood estimate (MLE) of the unknown pa¬ 
rameters is determined by the marginal likelihood of the 
observed data 

£(0;X)=p(X|0)=J>(X,Z|0) 

z 

However, this quantity is often intractable (e.g. if Z is a 
sequence of events, so that the number of values grows 
exponentially with the sequence length, making the exact 
calculation of the sum extremely difficult). 

The EM algorithm seeks to find the MLE of the marginal 
likelihood by iteratively applying the following two steps: 

Expectation step (E step): Calculate the 
expected value of the log likelihood function, 
with respect to the conditional distribution of 
Z given X under the current estimate of the pa¬ 
rameters 0 (,) : 


Q(6\d < ' t ' > ) = E z|x 0(t) [log 1/(0; X, Z)] 

Maximization step (M step): Find the pa¬ 
rameter that maximizes this quantity: 

Q(t+1) _ ar g max 

" e 


Note that in typical models to which EM is applied: 

1. The observed data points X may be discrete (tak¬ 
ing values in a finite or countably infinite set) or 
continuous (taking values in an uncountably infinite 
set). There may in fact be a vector of observations 
associated with each data point. 

2. The missing values (aka latent variables) Z are 
discrete, drawn from a fixed number of values, and 
there is one latent variable per observed data point. 

3. The parameters are continuous, and are of two 
kinds: Parameters that are associated with all data 
points, and parameters associated with a particular 
value of a latent variable (i.e. associated with all 
data points whose corresponding latent variable has 
a particular value). 

However, it is possible to apply EM to other sorts of mod¬ 
els. 

The motivation is as follows. If we know the value of 
the parameters 0 , we can usually find the value of the 
latent variables Z by maximizing the log-likelihood over 
all possible values of Z , either simply by iterating over Z 
or through an algorithm such as the Viterbi algorithm for 
hidden Markov models. Conversely, if we know the value 
of the latent variables Z , we can find an estimate of the 
parameters 6 fairly easily, typically by simply grouping 
the observed data points according to the value of the as¬ 
sociated latent variable and averaging the values, or some 
function of the values, of the points in each group. This 
suggests an iterative algorithm, in the case where both Q 
and Z are unknown: 

1. First, initialize the parameters Q to some random 
values. 

2. Compute the best value for Z given these parameter 
values. 

3. Then, use the just-computed values of Z to compute 
a better estimate for the parameters 6 . Parame¬ 
ters associated with a particular value of Z will use 
only those data points whose associated latent vari¬ 
able has that value. 

4. Iterate steps 2 and 3 until convergence. 

The algorithm as just described monotonically ap¬ 
proaches a local minimum of the cost function, and is 
commonly called hard EM. The A-means algorithm is an 
example of this class of algorithms. 

However, one can do somewhat better: Rather than mak¬ 
ing a hard choice for Z given the current parameter val¬ 
ues and averaging only over the set of data points asso¬ 
ciated with a particular value of Z , one can instead de¬ 
termine the probability of each possible value of Z for 
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each data point, and then use the probabilities associated 
with a particular value of Z to compute a weighted aver¬ 
age over the entire set of data points. The resulting al¬ 
gorithm is commonly called soft EM, and is the type of 
algorithm normally associated with EM. The counts used 
to compute these weighted averages are called soft counts 
(as opposed to the hard counts used in a hard-EM-type 
algorithm such as /'-means). The probabilities computed 
for Z are posterior probabilities and are what is computed 
in the E step. The soft counts used to compute new pa¬ 
rameter values are what is computed in the M step. 


write 


logp(X|0) = logp(X, Z| 9) - logp(Z|X, 6 ). 

We take the expectation over values of Z by multiplying 
both sides by p( Z|X, 0" ] ) and summing (or integrating) 
over Z . The left-hand side is the expectation of a con¬ 
stant, so we get: 


42.4 Properties 

Speaking of an expectation (E) step is a bit of a 
misnomer. What is calculated in the first step are the 
fixed, data-dependent parameters of the function Q. Once 
the parameters of Q are known, it is fully determined and 
is maximized in the second (M) step of an EM algorithm. 

Although an EM iteration does increase the observed data 
(i.e. marginal) likelihood function there is no guarantee 
that the sequence converges to a maximum likelihood es¬ 
timator. For multimodal distributions, this means that an 
EM algorithm may converge to a local maximum of the 
observed data likelihood function, depending on starting 
values. There are a variety of heuristic or metaheuristic 
approaches for escaping a local maximum such as random 
restart (starting with several different random initial esti¬ 
mates G ( ' y ), or applying simulated annealing methods. 

EM is particularly useful when the likelihood is an 
exponential family: the E step becomes the sum of ex¬ 
pectations of sufficient statistics, and the M step involves 
maximizing a linear function. In such a case, it is usu¬ 
ally possible to derive closed form updates for each step, 
using the Sundberg formula (published by Rolf Sundberg 
using unpublished results of Per Martin-Lof and Anders 
Martin-Lof) . PI 1 4 1 U] [8] [9] [to] [t t] 

The EM method was modified to compute maximum a 
posteriori (MAP) estimates for Bayesian inference in the 
original paper by Dempster, Laird, and Rubin. 

There are other methods for finding maximum likeli¬ 
hood estimates, such as gradient descent, conjugate gra¬ 
dient or variations of the Gauss-Newton method. Unlike 
EM, such methods typically require the evaluation of first 
and/or second derivatives of the likelihood function. 


logp(X|0) - 5>(Z|X. flW) logp(X, Z\0) - £>(Z|X, 0 (t) ) logp(Z|> 
z z 

= Q(G\G W ) + H(G\G (t) ), 

where H(G\G^) is defined by the negated sum it is re¬ 
placing. This last equation holds for any value of G in¬ 
cluding G = G it:> , 


logp(X|0 (t) ) = Q{G {t) \G w ) + H(G {t) \G w ), 

and subtracting this last equation from the previous equa¬ 
tion gives 


logp(X|0)-logp(X|0 (t) ) = Q{G\G {t) )-Q{G {t) \G (t) )+H{G\G [t) )-H{G { -' 

However, Gibbs’ inequality tells us that H(G\G ( - t ' > ) > 

H{G^ t ' > \G <yt ' > ) , so we can conclude that 


logp(X|0)—logp(X|0^) > Q(G\G^)-Q{G (t) \G^). 

In words, choosing G to improve Q{G\G i ' t ' > ) be¬ 
yond Q(G ( - i ^\G^ t '' > ) will improve logp(X|6J) beyond 
logp(X|# (, ' ) ) at least as much. 


42.6 Alternative description 

Under some circumstances, it is convenient to view 
the EM algorithm as two alternating maximization 
steps. 113,1161 Consider the function: 


42.5 Proof of correctness F (<h0) = ^q[^ogL(9; x, Z)]+H(q) = ~D KL (q\\p z \x{-\x;d))+logL(G; 


Expectation-maximization works to improve Q(G\6 < ' t ' ) ) 
rather than directly improving logp(X|0). Here we show 
that improvements to the former imply improvements to 
the latter. [14] 

For any Z with non-zero probability p(Z|X, G) , we can 


where q is an arbitrary probability distribution over the 
unobserved data z, pZ\X(- \x;6) is the conditional distri¬ 
bution of the unobserved data given the observed data x, 
H is the entropy and DKL is the Kullback-Leibler diver¬ 
gence. 

Then the steps in the EM algorithm may be viewed as: 
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Expectation step: Choose q to maximize F: 

q(i) _ * argmax q F(q,6^) 

Maximization step: Choose 6 to maximize F: 

Q(t+i) _ * ar g maXg F{q >Kt \d) 

42.7 Applications 

EM is frequently used for data clustering in machine 
learning and computer vision. In natural language pro¬ 
cessing, two prominent instances of the algorithm are the 
Baum-Welch algorithm and the inside-outside algorithm 
for unsupervised induction of probabilistic context-free 
grammars. 

In psychometrics, EM is almost indispensable for estimat¬ 
ing item parameters and latent abilities of item response 
theory models. 

With the ability to deal with missing data and observe 
unidentified variables, EM is becoming a useful tool to 
price and manage risk of a portfolio.[ref?] 

The EM algorithm (and its faster variant Ordered subset 
expectation maximization) is also widely used in medical 
image reconstruction, especially in positron emission to¬ 
mography and single photon emission computed tomog¬ 
raphy. See below for other faster variants of EM. 

42.8 Filtering and smoothing EM 
algorithms 

A Kalman filter is typically used for on-line state esti¬ 
mation and a minimum-variance smoother may be em¬ 
ployed for off-line or batch state estimation. However, 
these minimum-variance solutions require estimates of 
the state-space model parameters. EM algorithms can 
be used for solving joint state and parameter estimation 
problems. 

Filtering and smoothing EM algorithms arise by repeating 
the following two-step procedure: 

E-step Operate a Kalman filter or a minimum-variance 
smoother designed with current parameter estimates 
to obtain updated state estimates. 

M-step Use the filtered or smoothed state estimates 
within maximum-likelihood calculations to obtain 
updated parameter estimates. 

Suppose that a Kalman filter or minimum-variance 
smoother operates on noisy measurements of a single- 
input-single-output system. An updated measurement 
noise variance estimate can be obtained from the 
maximum likelihood calculation 


J2 ( z k - Xk) 2 

v fc=1 

where Xk are scalar output estimates calculated by a filter 
or a smoother from N scalar measurements Zk ■ Simi¬ 
larly, for a first-order auto-regressive process, an updated 
process noise variance estimate can be calculated by 

1 jv 2 

*1 = ^ ^ k + i - 

V k=1 

where Xk and Xf.+ \ are scalar state estimates calculated 
by a filter or a smoother. The updated model coefficient 
estimate is obtained via 


p _ Efc=i(£fc+i Fxk) 

Z^fc=i x k 

The convergence of parameter estimates such as those 
above are well studied. |1T|[181[19] 


42.9 Variants 

A number of methods have been proposed to acceler¬ 
ate the sometimes slow convergence of the EM algo¬ 
rithm, such as those using conjugate gradient and mod¬ 
ified Newton-Raphson techniques. 1201 Additionally EM 
can be used with constrained estimation techniques. 

Expectation conditional maximization (ECM) re¬ 
places each M step with a sequence of conditional maxi¬ 
mization (CM) steps in which each parameter Oi is maxi¬ 
mized individually, conditionally on the other parameters 
remaining fixed. 1211 

This idea is further extended in generalized expecta¬ 
tion maximization (GEM) algorithm, in which one only 
seeks an increase in the objective function F for both the 
E step and M step under the alternative description. 1131 
GEM is further developed in a distributed environment 
and shows promising results. [22) 

ft is also possible to consider the EM algorithm as 
a subclass of the MM (Majorize/Minimize or Mi- 
norize/Maximize, depending on context) algorithm, 1231 
and therefore use any machinery developed in the more 
general case. 

42.9.1 a-EM algorithm 

The Q-function used in the EM algorithm is based on the 
log likelihood. Therefore, it is regarded as the log-EM al¬ 
gorithm. The use of the log likelihood can be generalized 
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to that of the a-log likelihood ratio. Then, the a-log like¬ 
lihood ratio of the observed data can be exactly expressed 
as equality by using the Q-function of the a-log likelihood 
ratio and the a-divergence. Obtaining this Q-function is 
a generalized E step. Its maximization is a generalized M 
step. This pair is called the a-EM algorithm 1241 which 
contains the log-EM algorithm as its subclass. Thus, the 
a-EM algorithm by Yasuo Matsuyama is an exact gen¬ 
eralization of the log-EM algorithm. No computation of 
gradient or Hessian matrix is needed. The a-EM shows 
faster convergence than the log-EM algorithm by choos¬ 
ing an appropriate a. The a-EM algorithm leads to a 
faster version of the Hidden Markov model estimation al¬ 
gorithm a-HMM. [25] 

42.10 Relation to variational Bayes 
methods 

EM is a partially non-Bayesian, maximum likelihood 
method. Its final result gives a probability distribution 
over the latent variables (in the Bayesian style) together 
with a point estimate for 9 (either a maximum likeli¬ 
hood estimate or a posterior mode). We may want a fully 
Bayesian version of this, giving a probability distribution 
over 9 as well as the latent variables. In fact the Bayesian 
approach to inference is simply to treat 9 as another la¬ 
tent variable. In this paradigm, the distinction between 
the E and M steps disappears. If we use the factorized 
Q approximation as described above (variational Bayes), 
we may iterate over each latent variable (now including 
9) and optimize them one at a time. There are now k 
steps per iteration, where k is the number of latent vari¬ 
ables. For graphical models this is easy to do as each 
variable’s new Q depends only on its Markov blanket, so 
local message passing can be used for efficient inference. 

42.11 Geometric interpretation 

For more details on this topic, see Information geometry. 

In information geometry, the E step and the M step 
are interpreted as projections under dual affine connec¬ 
tions, called the e-connection and the m-connection; the 
Kullback-Leibler divergence can also be understood in 
these terms. 


42.12 Examples 

42.12.1 Gaussian mixture 

Let x = (xi, X 2 ,..., x„) be a sample of n independent 
observations from a mixture of two multivariate normal 
distributions of dimension d , and let z = ( Z\ , z 2 ,..., z n ) 


Waiting time vs Eruption time 
Old Faithful geyser 



An animation demonstrating the EM algorithm fitting a two com¬ 
ponent Gaussian mixture model to the Old Faithful dataset. The 
algorithm steps through from a random initialization to conver¬ 
gence. 

be the latent variables that determine the component from 
which the observation originates. 1161 

Xi\(Zi = 1) - r) and = 

2) ~ J\f d (fi 2 , S 2 ) 

where 

P (Zi = 1) = n and P (Zi = 2) = r 2 = 

1 - Tl 

The aim is to estimate the unknown parameters repre¬ 
senting the “mixing” value between the Gaussians and the 
means and covariances of each: 

where the incomplete-data likelihood function is 

n 2 

L(d- x) = Tj f(Xi ; 

i= 1 3 =1 

and the complete-data likelihood function is 

n 2 

L{0\ x, z) = P(x, z\0) = = j) 

i=1 j=l 

or 


{ n 2 

=3)[^>gTj - 5 log | Sj | 

i=l 3=1 
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where I is an indicator function and / is the probability This has the same form as the MLE for the binomial dis- 


density function of a multivariate normal. 


tribution, so 


To see the last equality, note that for each i all indicators 
I(zi = j) are equal to zero, except for one which is equal 
to one. The inner sum thus reduces to a single term. 


r (t+i) 


(*) 


E n rpyi, 

i =1 ± j,i 




For the next estimates of (pi,£?i): 


E step 


(t) 


i=1 


Given our current estimate of the parameters 0 [, \ the con¬ 
ditional distribution of the Zi is determined by Bayes the¬ 


orem to be the proportional height of the normal density ^(G-i) E^ t+1 ^) = argmax Q(9\9 *■*■*) 


weighted by t: 




rjj ~p(z i =j\x i = x i -e^) = 




= argmax {-|l°g| E i| - ~ IP. 

Mi.Si i=1 


T sr 1 ( J 


r i /( x t i Mi \ E i ^) "this iiiislfbe /AuTie^arn) as a weighted MLE for a normal 
These are called the “membership probabilities” which distribution, so 
are normally considered the output of the E step (although 


this is not the Q function of below). 


Note that this E step corresponds with the following func¬ 
tion for Q: 


/4 t+1) 


= and E< t+1) 




Q(9\9^) = E[log L(0; x, Z)] 

n 

= E[log]^[T(6 | ;x i) z i )] 


and, by symmetry 


(*) 


/4‘ +1) = and E^> 

E” = i^5(s,-^ +11 )(*.-/*l‘ +11 ) T 


= E[J^ logT(0;Xj,Zj)] 

Z=1 

n 

= E[log L(9: X,;,Z,;)] 


V-m rp(t) 

2 ^i = 1 ^ 2,i 


7=1 


Termination 


= EE 7 j? [logTj- — | log |Ej| — i(xj — /PpoiiH^icl^xthe /itp)-atr\^lqj$£5e^ if logL(0‘; x, Z) < 


i=i j =l 


This does not need to be calculated, because in the M 
step we only require the terms depending on r when we 
maximize for r, or only the terms depending on u if we 
maximize for p. 


logL(0(* 1 i;x,Z) + e for e below some preset thresh¬ 
old. 


Generalization 


M step 


The algorithm illustrated above can be generalized for 
mixtures of more than two multivariate normal distribu¬ 
tions. 


The fact that Q{0\0 (,> ) is quadratic in form means that 
determining the maximizing values of 0 is relatively 

straightforward. Note that r, (pr,^) and (p 2 ,2’ 2 ) may 42 .12.2 Truncated and censored regres- 


all be maximized independently since they all appear in 
separate linear terms. 


sion 


To begin, consider r, which has the constraint + r 2 -l: The EM algorithm has been implemented in the case 

where there is an underlying linear regression model ex- 


T (t+1 ) = argmax Q(9\9^) 


= arg max 


Erf,' 


log El 


e^ ( : 


(*) 


plaining the variation of some quantity, but where the val¬ 
ues actually observed are censored or truncated versions 
of those represented in the model. 126 Special cases of this 
l otr r mi)del include censored or truncated observations from a 
° single normal distribution. 1261 
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42.13 Alternatives to EM 

EM typically converges to a local optimum—not neces¬ 
sarily the global optimum—and there is no bound on the 
convergence rate in general. It is possible that it can be 
arbitrarily poor in high dimensions and there can be an ex¬ 
ponential number of local optima. Hence, there is a need 
for alternative techniques for guaranteed learning, espe¬ 
cially in the high-dimensional setting. There are alterna¬ 
tives to EM with better guarantees in terms of consistency 
which are known as moment-based approaches or the so- 
called “spectral techniques”. Moment-based approaches 
to learning the parameters of a probabilistic model are 
of increasing interest recently since they enjoy guaran¬ 
tees such as global convergence under certain conditions 
unlike EM which is often plagued by the issue of get¬ 
ting stuck in local optima. Algorithms with guarantees 
for learning can be derived for a number of important 
models such as mixture models, HMMs etc. For these 
spectral methods, there are no spurious local optima and 
the true parameters can be consistently estimated under 
some regularity conditions. 


42.14 See also 

• Density estimation 

• Total absorption spectroscopy 

• The EM algorithm can be viewed as a special case 
of the majorize-minimization (MM) algorithm. 1271 


42.15 Further reading 

• Robert Hogg, Joseph McKean and Allen Craig. In¬ 
troduction to Mathematical Statistics, pp. 359-364. 
Upper Saddle River, NJ: Pearson Prentice Hall, 
2005. 

• The on-line textbook: Information Theory, In¬ 
ference, and Learning Algorithms, by David J.C. 
MacKay includes simple examples of the EM algo¬ 
rithm such as clustering using the soft k-means al¬ 
gorithm, and emphasizes the variational view of the 
EM algorithm, as described in Chapter 33.7 of ver¬ 
sion 7.2 (fourth edition). 

• Dellaert, Frank. “The Expectation Maximization 
Algorithm”. CiteSeerX: 10.1.1.9.9735, gives an 
easier explanation of EM algorithm in terms of 
lowerbound maximization. 

• Bishop, Christopher M. (2006). Pattern Recogni¬ 
tion and Machine Learning. Springer. ISBN 0-387- 
31073-8. 


• M. R. Gupta and Y. Chen (2010). Theory and Use 
of the EM Algorithm. doi:10.1561/2000000034. A 
well-written short book on EM, including detailed 
derivation of EM for GMMs, HMMs, and Dirichlet. 

• Bilmes, Jeff. “A Gentle Tutorial of the EM Al¬ 
gorithm and its Application to Parameter Estima¬ 
tion for Gaussian Mixture and Hidden Markov Mod¬ 
els”. CiteSeerX: 10.1.1.28.613, includes a sim¬ 
plified derivation of the EM equations for Gaus¬ 
sian Mixtures and Gaussian Mixture Hidden Markov 
Models. 

• Variational Algorithms for Approximate Bayesian 
Inference, by M. J. Beal includes comparisons of 
EM to Variational Bayesian EM and derivations 
of several models including Variational Bayesian 
HMMs (chapters). 

• The Expectation Maximization Algorithm: A short 
tutorial, A self-contained derivation of the EM Al¬ 
gorithm by Sean Borman. 

• The EM Algorithm, by Xiaojin Zhu. 

• EM algorithm and variants: an informal tutorial by 
Alexis Roche. A concise and very clear description 
of EM and many interesting variants. 

• Einicke, G.A. (2012). Smoothing, Filtering and Pre¬ 
diction: Estimating the Past, Present and Future. Ri¬ 
jeka, Croatia: Intech. ISBN 978-953-307-752-9. 
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42.17 External links 

• Various ID, 2D and 3D demonstrations of EM to¬ 
gether with Mixture Modeling are provided as part 
of the paired SOCR activities and applets. These 
applets and activities show empirically the proper¬ 
ties of the EM algorithm for parameter estimation 
in diverse settings. 

• k-MLE: A fast algorithm for learning statistical mix¬ 
ture models 

• Class hierarchy in C++ (GPL) including Gaussian 
Mixtures 

• Fast and clean C implementation of the Expectation 
Maximization (EM) algorithm for estimating 
Gaussian Mixture Models (GMMs). 


Chapter 43 


k-means clustering 


k-means clustering is a method of vector quantiza¬ 
tion, originally from signal processing, that is popular for 
cluster analysis in data mining. k-means clustering aims 
to partition n observations into k clusters in which each 
observation belongs to the cluster with the nearest mean, 
serving as a prototype of the cluster. This results in a 
partitioning of the data space into Voronoi cells. 

The problem is computationally difficult (NP-hard); how¬ 
ever, there are efficient heuristic algorithms that are com¬ 
monly employed and converge quickly to a local op¬ 
timum. These are usually similar to the expectation- 
maximization algorithm for mixtures of Gaussian distri¬ 
butions via an iterative refinement approach employed by 
both algorithms. Additionally, they both use cluster cen¬ 
ters to model the data; however, k-means clustering tends 
to find clusters of comparable spatial extent, while the 
expectation-maximization mechanism allows clusters to 
have different shapes. 

The algorithm has nothing to do with and should not 
be confused with k-nearest neighbor, another popular 
machine learning technique. 

43.1 Description 

Given a set of observations (xi, X 2 , ..., xn), where each 
observation is a rZ-dimensional real vector, k-means clus¬ 
tering aims to partition the n observations into k (< n) sets 
S = {5i, S 2 , ..., Sk } so as to minimize the within-cluster 
sum of squares (WCSS). In other words, its objective is 
to find: 

argmin £,f =1 Exes, ll x “ ALII 2 
s 

where fii is the mean of points in Si. 


43.2 History 

The term "k-means” was first used by James MacQueen 
in 1967, 1 1 though the idea goes back to Hugo Steinhaus 
in 1957. 121 The standard algorithm was first proposed by 


Stuart Lloyd in 1957 as a technique for pulse-code mod¬ 
ulation, though it wasn't published outside of Bell Labs 
until 1982. [31 In 1965, E.W.Forgy published essentially 
the same method, which is why it is sometimes referred 
to as Lloyd-Forgy. 141 A more efficient version was pro¬ 
posed and published in Fortran by Hartigan and Wong in 
1975/1979. [5![6! 

43.3 Algorithms 

43.3.1 Standard algorithm 

The most common algorithm uses an iterative refinement 
technique. Due to its ubiquity it is often called the k- 
means algorithm; it is also referred to as Lloyd’s algo¬ 
rithm, particularly in the computer science community. 

Given an initial set of k means (see be¬ 

low), the algorithm proceeds by alternating between two 
steps: 171 

Assignment step: Assign each observation to 
the cluster whose mean yields the least within- 
cluster sum of squares (WCSS). Since the sum 
of squares is the squared Euclidean distance, 
this is intuitively the “nearest” mean. 181 (Math¬ 
ematically, this means partitioning the obser¬ 
vations according to the Voronoi diagram gen¬ 
erated by the means). 

= { x p : ||x p — < 

j | x p rri'j ’ll” Vj, i < j < k}, 

where each x p is assigned to exactly 
one s w , even if it could be as¬ 
signed to two or more of them. 

Update step: Calculate the new means to be 
the centroids of the observations in the new 
clusters. 

0+1) 1 v—v 

m i =j^E xj 6 S w*i 

Since the arithmetic mean is a least- 
squares estimator, this also min¬ 
imizes the within-cluster sum of 
squares (WCSS) objective. 
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The algorithm has converged when the assignments no 
longer change. Since both steps optimize the WCSS ob¬ 
jective, and there only exists a finite number of such par¬ 
titionings, the algorithm must converge to a (local) opti¬ 
mum. There is no guarantee that the global optimum is 
found using this algorithm. 

The algorithm is often presented as assigning objects to 
the nearest cluster by distance. The standard algorithm 
aims at minimizing the WCSS objective, and thus assigns 
by “least sum of squares”, which is exactly equivalent to 
assigning by the smallest Euclidean distance. Using a dif¬ 
ferent distance function other than (squared) Euclidean 
distance may stop the algorithm from converging. Vari¬ 
ous modifications of k-means such as spherical k-means 
and k-medoids have been proposed to allow using other 
distance measures. 

Initialization methods 

Commonly used initialization methods are Forgy and 
Random Partition. 191 The Forgy method randomly 
chooses A observations from the data set and uses these 
as the initial means. The Random Partition method first 
randomly assigns a cluster to each observation and then 
proceeds to the update step, thus computing the initial 
mean to be the centroid of the cluster’s randomly as¬ 
signed points. The Forgy method tends to spread the 
initial means out, while Random Partition places all of 
them close to the center of the data set. According to 
Hamerly et al., 191 the Random Partition method is gen¬ 
erally preferable for algorithms such as the A-harmonic 
means and fuzzy A-means. For expectation maximiza¬ 
tion and standard A-means algorithms, the Forgy method 
of initialization is preferable. 

• Demonstration of the standard algorithm 

• 1 . k initial “means” (in this case k= 3) are randomly 
generated within the data domain (shown in color). 

• 2. k clusters are created by associating every ob¬ 
servation with the nearest mean. The partitions 
here represent the Voronoi diagram generated by the 
means. 

• 3. The centroid of each of the k clusters becomes 
the new mean. 

• 4. Steps 2 and 3 are repeated until convergence has 
been reached. 

As it is a heuristic algorithm, there is no guarantee that it 
will converge to the global optimum, and the result may 
depend on the initial clusters. As the algorithm is usu¬ 
ally very fast, it is common to run it multiple times with 
different starting conditions. However, in the worst case, 
A-means can be very slow to converge: in particular it has 
been shown that there exist certain point sets, even in 2 di¬ 
mensions, on which A-means takes exponential time, that 


is 2 n "°, to converge. 1101 These point sets do not seem to 
arise in practice: this is corroborated by the fact that the 
smoothed running time of A-means is polynomial. 1111 

The “assignment” step is also referred to as expectation 
step, the “update step” as maximization step, making 
this algorithm a variant of the generalized expectation- 
maximization algorithm. 

43.3.2 Complexity 

Regarding computational complexity, finding the optimal 
solution to the A-means clustering problem for observa¬ 
tions in d dimensions is: 

• NP-hard in general Euclidean space d even for 2 
clusters 11211131 

• NP-hard for a general number of clusters A even in 
the plane 1 141 

• If A and d (the dimension) are fixed, the problem can 
be exactly solved in time 0{n dk+1 logn) , where n 
is the number of entities to be clustered 1151 

Thus, a variety of heuristic algorithms such as Floyds al¬ 
gorithm given above are generally used. 

The running time of Floyds algorithm is often given as 
0(nkdi) , where n is the number of tZ-dimensional vec¬ 
tors, A the number of clusters and i the number of itera¬ 
tions needed until convergence. On data that does have 
a clustering structure, the number of iterations until con¬ 
vergence is often small, and results only improve slightly 
after the first dozen iterations. Floyds algorithm is there¬ 
fore often considered to be of “linear” complexity in prac¬ 
tice. 

Following are some recent insights into this algorithm 
complexity behaviour. 

• Floyd’s A-means algorithm has polynomial 
smoothed running time. It is shown that 1111 
for arbitrary set of n points in [0, l] d , if each point 
is independently perturbed by a normal distribution 
with mean 0 and variance a 1 , then the expected 
running time of k -means algorithm is bounded by 
0(n 34 k 34 d 8 log 4 (n)/a 6 ) , which is a polynomial 
in n , k , d and 1 /er. 

• Better bounds are proved for simple cases. For 
example, 1161 showed that the running time of A- 
means algorithm is bounded by 0(dn 4 M 2 ) for n 
points in an integer lattice {1,..., M} d . 

43.3.3 Variations 

• Jenks natural breaks optimization: A-means applied 
to univariate data 
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• k-medians clustering uses the median in each di¬ 
mension instead of the mean, and this way mini¬ 
mizes L\ norm (Taxicab geometry). 

• k-medoids (also: Partitioning Around Medoids, 
PAM) uses the medoid instead of the mean, and this 
way minimizes the sum of distances for arbitrary 
distance functions. 



• Fuzzy C-Means Clustering is a soft version of K- 
means, where each data point has a fuzzy degree of 
belonging to each cluster. 

• Gaussian mixture models trained with expectation- 
maximization algorithm (EM algorithm) maintains 
probabilistic assignments to clusters, instead of de¬ 
terministic assignments, and multivariate Gaussian 
distributions instead of means. 

• k-means-H- chooses initial centers in a way that gives 
a provable upper bound on the WCCS objective. 


k-means clustering result for the Iris flower data set and actual 
species visualized using ELKI. Cluster means are marked using 
larger, semi-transparent symbols. 


Different cluster analysis results on "mouse" data set: 
Original Data k-Means Clustering EM Clustering 



• The filtering algorithm uses kd-trees to speed up 
each k-means step. 1171 

• Some methods attempt to speed up each k-means 
step using coresets 1 or the triangle inequality. 1191 

• Escape local optima by swapping points between 
clusters. 161 

• The Spherical k-means clustering algorithm is suit¬ 
able for directional data. 1201 

• The Minkowski metric weighted k-means deals 
with irrelevant features by assigning cluster specific 
weights to each feature 1211 

43.4 Discussion 



A typical example of the k-means convergence to a local mini¬ 
mum. In this example, the result of k-means clustering (the right 
figure) contradicts the obvious cluster structure of the data set. 
The small circles are the data points, the four ray stars are the 
centroids (means). The initial configuration is on the left figure. 
The algorithm converges after five iterations presented on the fig¬ 
ures, from the left to the right. The illustration was prepared with 
the Mirkes Java applet. 1221 

The two key features of k-means which make it efficient 
are often regarded as its biggest drawbacks: 

• Euclidean distance is used as a metric and variance 
is used as a measure of cluster scatter. 


k-means clustering and EM clustering on an artificial dataset 
(“mouse”). The tendency of k-means to produce equi-sized clus¬ 
ters leads to bad results, while EM benefits from the Gaussian 
distribution present in the data set 

• The number of clusters k is an input parameter: an 
inappropriate choice of k may yield poor results. 
That is why, when performing k-means, it is im¬ 
portant to run diagnostic checks for determining the 
number of clusters in the data set. 

• Convergence to a local minimum may produce 
counterintuitive (“wrong”) results (see example in 

Fig.). 

A key limitation of k-means is its cluster model. The con¬ 
cept is based on spherical clusters that are separable in a 
way so that the mean value converges towards the cluster 
center. The clusters are expected to be of similar size, 
so that the assignment to the nearest cluster center is the 
correct assignment. When for example applying k-means 
with a value of k = 3 onto the well-known Iris flower data 
set, the result often fails to separate the three Iris species 
contained in the data set. With k = 2 , the two visible 
clusters (one containing two species) will be discovered, 
whereas with k = 3 one of the two clusters will be split 
into two even parts. In fact, k = 2 is more appropriate 
for this data set, despite the data set containing 3 classes. 
As with any other clustering algorithm, the k-means re¬ 
sult relies on the data set to satisfy the assumptions made 
by the clustering algorithms. It works well on some data 
sets, while failing on others. 

The result of k-means can also be seen as the Voronoi cells 
of the cluster means. Since data is split halfway between 
cluster means, this can lead to suboptimal splits as can be 
seen in the “mouse” example. The Gaussian models used 
by the Expectation-maximization algorithm (which can 
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be seen as a generalization of A-means) are more flexible 
here by having both variances and covariances. The EM 
result is thus able to accommodate clusters of variable size 
much better than A-means as well as correlated clusters 
(not in this example). 


43.5 Applications 

A-means clustering in particular when using heuristics 
such as Lloyd’s algorithm is rather easy to implement 
and apply even on large data sets. As such, it has been 
successfully used in various topics, including market seg¬ 
mentation, computer vision, geostatistics, 1231 astronomy 
and agriculture. It often is used as a preprocessing step 
for other algorithms, for example to find a starting con¬ 
figuration. 



50 100 150 200 250 

Red level 


Vector quantization of colors present in the image above into 
Voronoi cells using k -means. 


43.5.1 Vector quantization 

Main article: Vector quantization 

A-means originates from signal processing, and still finds 



Two-channel (for illustration purposes — red and green only) 
color image. 

use in this domain. For example in computer graphics, 
color quantization is the task of reducing the color palette 
of an image to a fixed number of colors A. The A-means 
algorithm can easily be used for this task and produces 
competitive results. Other uses of vector quantization 
include non-random sampling, as A-means can easily be 
used to choose A different but prototypical objects from 
a large data set for further analysis. 


43.5.2 Cluster analysis 

Main article: Cluster analysis 

In cluster analysis, the A-means algorithm can be used to 
partition the input data set into A partitions (clusters). 

However, the pure A-means algorithm is not very flexi¬ 
ble, and as such of limited use (except for when vector 
quantization as above is actually the desired use case!). 
In particular, the parameter A is known to be hard to 
choose (as discussed above) when not given by external 
constraints. Another limitation of the algorithm is that 
it cannot be used with arbitrary distance functions or on 
non-numerical data. For these use cases, many other al¬ 
gorithms have been developed since. 


43.5.3 Feature learning 

A-means clustering has been used as a feature learning 
(or dictionary learning) step, in either (semi-)supervised 
learning or unsupervised learning. 1241 The basic approach 
is first to train a A-means clustering representation, us¬ 
ing the input training data (which need not be labelled). 
Then, to project any input datum into the new feature 
space, we have a choice of “encoding” functions, but we 
can use for example the thresholded matrix-product of 
the datum with the centroid locations, the distance from 
the datum to each centroid, or simply an indicator func¬ 
tion for the nearest centroid, 12411251 or some smooth trans¬ 
formation of the distance. 126 ' Alternatively, by transform¬ 
ing the sample-cluster distance through a Gaussian RBF, 
one effectively obtains the hidden layer of a radial basis 
function network. 1271 

This use of A-means has been successfully combined 
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with simple, linear classifiers for semi-supervised learn¬ 
ing in NLP (specifically for named entity recognition) 1281 
and in computer vision. On an object recognition 
task, it was found to exhibit comparable performance 
with more sophisticated feature learning approaches such 
as autoencoders and restricted Boltzmann machines. 1261 
However, it generally requires more data than the sophis¬ 
ticated methods, for equivalent performance, because 
each data point only contributes to one “feature” rather 
than multiple. 1241 

43.6 Relation to other statistical 
machine learning algorithms 

A-means clustering, and its associated expectation- 
maximization algorithm, is a special case of a Gaussian 
mixture model, specifically, the limit of taking all co- 
variances as diagonal, equal, and small. It is often easy 
to generalize a A-means problem into a Gaussian mix¬ 
ture model. 1291 Another generalization of the k-means al¬ 
gorithm is the K-SVD algorithm, which estimates data 
points as a sparse linear combination of “codebook vec¬ 
tors”. K-means corresponds to the special case of using 
a single codebook vector, with a weight of l. 1301 

43.6.1 Mean shift clustering 

Basic mean shift clustering algorithms maintain a set of 
data points the same size as the input data set. Initially, 
this set is copied from the input set. Then this set is itera¬ 
tively replaced by the mean of those points in the set that 
are within a given distance of that point. By contrast, A- 
means restricts this updated set to A points usually much 
less than the number of points in the input data set, and 
replaces each point in this set by the mean of all points 
in the input set that are closer to that point than any other 
(e.g. within the Voronoi partition of each updating point). 
A mean shift algorithm that is similar then to A-means, 
called likelihood mean shift, replaces the set of points un¬ 
dergoing replacement by the mean of all points in the in¬ 
put set that are within a given distance of the changing 
set. 1311 One of the advantages of mean shift over A-means 
is that there is no need to choose the number of clusters, 
because mean shift is likely to find only a few clusters if 
indeed only a small number exist. However, mean shift 
can be much slower than A-means, and still requires se¬ 
lection of a bandwidth parameter. Mean shift has soft 
variants much as A-means does. 

43.6.2 Principal component analysis 
(PCA) 

It was asserted in 13211331 that the relaxed solution of k- 
means clustering, specified by the cluster indicators, is 


given by the PCA (principal component analysis) prin¬ 
cipal components, and the PCA subspace spanned by 
the principal directions is identical to the cluster cen¬ 
troid subspace. However, that PCA is a useful relax¬ 
ation of k-means clustering was not a new result (see, for 
example, 1341 ), and it is straightforward to uncover coun¬ 
terexamples to the statement that the cluster centroid sub¬ 
space is spanned by the principal directions. 1351 

43.6.3 Independent component analysis 
(ICA) 

It has been shown in 1361 that under sparsity assumptions 
and when input data is pre-processed with the whitening 
transformation A-means produces the solution to the lin¬ 
ear Independent component analysis task. This aids in ex¬ 
plaining the successful application of A-means to feature 
learning. 

43.6.4 Bilateral filtering 

A-means implicitly assumes that the ordering of the input 
data set does not matter. The bilateral filter is similar to 
K-means and mean shift in that it maintains a set of data 
points that are iteratively replaced by means. However, 
the bilateral filter restricts the calculation of the (kernel 
weighted) mean to include only points that are close in the 
ordering of the input data. 1311 This makes it applicable to 
problems such as image denoising, where the spatial ar¬ 
rangement of pixels in an image is of critical importance. 

43.7 Similar problems 

The set of squared error minimizing cluster functions also 
includes the k-medoids algorithm, an approach which 
forces the center point of each cluster to be one of the 
actual points, i.e., it uses medoids in place of centroids. 

43.8 Software Implementations 

43.8.1 Free 

• CrimeStat implements two spatial A-means algo¬ 
rithms, one of which allows the user to define the 
starting locations. 

• ELKI contains A-means (with Lloyd and MacQueen 
iteration, along with different initializations such 
as A-means++ initialization) and various more ad¬ 
vanced clustering algorithms. 

• Julia contains a A-means implementation in the 
Clustering package. 1371 

• Mahout contains a MapReduce based A-means. 
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• MLPACK contains a C++ implementation of A- 
means. 

• Octave contains A-means. 

• OpenCV contains a A-means implementation. 

• R contains three A-means variations. 11 

• SciPy and scikit-learn contain multiple A-means im¬ 
plementations. 

• Spark MLlib implements a distributed A-means al¬ 
gorithm. 

• Torch contains an unsup package that provides A- 
means clustering. 

• Weka contains A-means and x-means. 

43.8.2 Commercial 

• Grapheme 

• MATLAB 

• Mathematica 

• SAS 

• Stata 

43.9 See also 

• Canopy clustering algorithm 

• Centroidal Voronoi tessellation 

• k q-flats 

• Linde-Buzo-Gray algorithm 

• Nearest centroid classifier 

• Self-organizing map 

• Silhouette clustering 

• Head/tail Breaks 
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Chapter 44 

Hierarchical clustering 


In data mining and statistics, hierarchical clustering 
(also called hierarchical cluster analysis or HCA) is 
a method of cluster analysis which seeks to build a 
hierarchy of clusters. Strategies for hierarchical cluster¬ 
ing generally fall into two types: 1 11 

• Agglomerative: This is a “bottom up” approach: 
each observation starts in its own cluster, and pairs 
of clusters are merged as one moves up the hierar¬ 
chy. 

• Divisive: This is a “top down” approach: all obser¬ 
vations start in one cluster, and splits are performed 
recursively as one moves down the hierarchy. 

In general, the merges and splits are determined in a 
greedy manner. The results of hierarchical clustering are 
usually presented in a dendrogram. 

In the general case, the complexity of agglomerative clus¬ 
tering is 0(n 3 ) , which makes them too slow for large 
data sets. Divisive clustering with an exhaustive search is 
0(2 n ) , which is even worse. However, for some special 
cases, optimal efficient agglomerative methods (of com¬ 
plexity 0[n 2 ) ) are known: SLINK 121 for single-linkage 
and CLINK 131 for complete-linkage clustering. 

44.1 Cluster dissimilarity 

In order to decide which clusters should be combined (for 
agglomerative), or where a cluster should be split (for di¬ 
visive), a measure of dissimilarity between sets of obser¬ 
vations is required. In most methods of hierarchical clus¬ 
tering, this is achieved by use of an appropriate metric (a 
measure of distance between pairs of observations), and a 
linkage criterion which specifies the dissimilarity of sets 
as a function of the pairwise distances of observations in 
the sets. 

44.1.1 Metric 


The choice of an appropriate metric will influence the 
shape of the clusters, as some elements may be close to 
one another according to one distance and farther away 
according to another. For example, in a 2-dimensional 
space, the distance between the point (1,0) and the ori¬ 
gin (0,0) is always 1 according to the usual norms, but 
the distance between the point (1,1) and the origin (0,0) 
can be 2 under Manhattan distance, %/2 under Euclidean 
distance, or 1 under maximum distance. 

Some commonly used metrics for hierarchical clustering 
are: 141 

For text or other non-numeric data, metrics such as the 
Hamming distance or Levenshtein distance are often 
used. 

A review of cluster analysis in health psychology research 
found that the most common distance measure in pub¬ 
lished studies in that research area is the Euclidean dis¬ 
tance or the squared Euclidean distance. 

44.1.2 Linkage criteria 

The linkage criterion determines the distance between 
sets of observations as a function of the pairwise distances 
between observations. 

Some commonly used linkage criteria between two sets 
of observations A and B are: 15 ' 161 

where d is the chosen metric. Other linkage criteria in¬ 
clude: 

• The sum of all intra-cluster variance. 

• The decrease in variance for the cluster being 
merged (Ward’s criterion). 171 

• The probability that candidate clusters spawn from 
the same distribution function (V-linkage). 

• The product of in-degree and out-degree on a k- 
nearest-neighbor graph (graph degree linkage). 181 


Further information: metric (mathematics) 
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• The increment of some cluster descriptor (i.e., a 
quantity defined for measuring the quality of a clus¬ 
ter) after merging two clusters. 19111011111 


298 


CHAPTER 44. HIERARCHICAL CLUSTERING 


44.2 Discussion 

Hierarchical clustering has the distinct advantage that any 
valid measure of distance can be used. In fact, the obser¬ 
vations themselves are not required: all that is used is a 
matrix of distances. 

44.3 Example for Agglomerative 
Clustering 

For example, suppose this data is to be clustered, and the 
Euclidean distance is the distance metric. 

Cutting the tree at a given height will give a partitioning 
clustering at a selected precision. In this example, cutting 
after the second row of the dendrogram will yield clusters 
{a} {be} {d e} {f}. Cutting after the third row will yield 
clusters {a} {be} {def}, which is a coarser clustering, 
with a smaller number but larger clusters. 



Raw data 

The hierarchical clustering dendrogram would be as such: 

This method builds the hierarchy from the individual ele¬ 
ments by progressively merging clusters. In our example, 
we have six elements {a} {b} {c} {d} {e} and {f}. The 
first step is to determine which elements to merge in a 
cluster. Usually, we want to take the two closest elements, 
according to the chosen distance. 

Optionally, one can also construct a distance matrix at 
this stage, where the number in the /-th row /'-th column 
is the distance between the /-th and /'-th elements. Then, 
as clustering progresses, rows and columns are merged as 
the clusters are merged and the distances updated. This is 
a common way to implement this type of clustering, and 
has the benefit of caching distances between clusters. A 
simple agglomerative clustering algorithm is described in 



the single-linkage clustering page; it can easily be adapted 
to different types of linkage (see below). 

Suppose we have merged the two closest elements b and 
c, we now have the following clusters {a}, {b, c}, {<i}, 
{e} and {/}, and want to merge them further. To do that, 
we need to take the distance between {a} and {be}, and 
therefore define the distance between two clusters. Usu¬ 
ally the distance between two clusters A and B is one of 
the following: 

• The maximum distance between elements of each 
cluster (also called complete-linkage clustering): 


max{ d(x, y) : x £ A, y £ B}. 

• The minimum distance between elements of each 
cluster (also called single-linkage clustering): 


min{ d( x, y) : x € A, y € B}. 

• The mean distance between elements of each cluster 
(also called average linkage clustering, used e.g. in 
UPGMA): 


• The sum of all intra-cluster variance. 

• The increase in variance for the cluster being merged 
(Ward’s method<ref name=" 171 ) 

• The probability that candidate clusters spawn from 
the same distribution function (V-linkage). 
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Each agglomeration occurs at a greater distance between 
clusters than the previous agglomeration, and one can de¬ 
cide to stop clustering either when the clusters are too far 
apart to be merged (distance criterion) or when there is a 
sufficiently small number of clusters (number criterion). 


44.4 Software 

44.4.1 Open Source Frameworks 

• R has several functions for hierarchical clustering: 
see CRAN Task View: Cluster Analysis & Finite 
Mixture Models for more information. 

• Cluster 3.0 provides a nice Graphical User Interface 
to access to different clustering routines and is avail¬ 
able for Windows, Mac OS X, Linux, Unix. 

• ELKI includes multiple hierarchical clustering algo¬ 
rithms, various linkage strategies and also includes 
the efficient SLINK 121 algorithm, flexible cluster ex¬ 
traction from dendrograms and various other cluster 
analysis algorithms. 

• Octave, the GNU analog to MATLAB implements 
hierarchical clustering in linkage function 

• Orange, a free data mining software suite, module 
orngClustering for scripting in Python, or cluster 
analysis through visual programming. 

• scikit-learn implements a hierarchical clustering. 

• Weka includes hierarchical cluster analysis. 

• fastCluster efficiently implements the seven most 
widely used clustering schemes. 

• SCaViS computing environment in Java that imple¬ 
ments this algorithm. 


44.4.2 Standalone implementations 

• CrimeStat implements two hierarchical clustering 
routines, a nearest neighbor (Nnh) and a risk- 
adjusted(Rnnh). 

• figue is a JavaScript package that implements some 
agglomerative clustering functions (single-linkage, 
complete-linkage, average-linkage) and functions to 
visualize clustering output (e.g. dendrograms). 

• hcluster is a Python implementation, based on 
NumPy, which supports hierarchical clustering and 
plotting. 

• Hierarchical Agglomerative Clustering imple¬ 
mented as C# visual studio project that includes real 
text files processing, building of document-term 
matrix with stop words filtering and stemming. 


• MultiDendrograms An open source Java application 
for variable-group agglomerative hierarchical clus¬ 
tering, with graphical user interface. 

• Graph Agglomerative Clustering (GAC) toolbox 
implemented several graph-based agglomerative 
clustering algorithms. 

• Hierarchical Clustering Explorer provides tools for 
interactive exploration of multidimensional data. 


44.4.3 Commercial 

• MATLAB includes hierarchical cluster analysis. 

• SAS includes hierarchical cluster analysis. 

• Mathematica includes a Hierarchical Clustering 
Package. 

• NCSS (statistical software) includes hierarchical 
cluster analysis. 

• SPSS includes hierarchical cluster analysis. 

• Qlucore Omics Explorer includes hierarchical clus¬ 
ter analysis. 

• Stata includes hierarchical cluster analysis. 

44.5 See also 

• Statistical distance 

• Brown clustering 

• Cluster analysis 

• CURE data clustering algorithm 

• Dendrogram 

• Determining the number of clusters in a data set 

• Hierarchical clustering of networks 

• Nearest-neighbor chain algorithm 

• Numerical taxonomy 

• OPTICS algorithm 

• Nearest neighbor search 

• Locality-sensitive hashing 
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Instance-based learning 


In machine learning, instance-based learning (some¬ 
times called memory-based learning 1 * 1 ) is a family of 
learning algorithms that, instead of performing explicit 
generalization, compares new problem instances with in¬ 
stances seen in training, which have been stored in mem¬ 
ory. Instance-based learning is a kind of lazy learning. 

It is called instance-based because it constructs hypothe¬ 
ses directly from the training instances themselves. 121 This 
means that the hypothesis complexity can grow with the 
data: 121 in the worst case, a hypothesis is a list of n training 
items and the computational complexity of classifying a 
single new instance is 0{n). One advantage that instance- 
based learning has over other methods of machine learn¬ 
ing is its ability to adapt its model to previously unseen 
data: instance-based learners may simply store a new in¬ 
stance or throw an old instance away. 

Examples of instance-based learning algorithm are the 
k-nearest neighbor algorithm, kernel machines and RBF 
networks. 131 :ch 8 These store (a subset of) their training 
set; when predicting a value/class for a new instance, they 
compute distances or similarities between this instance 
and the training instances to make a decision. 

To battle the memory complexity of storing all training 
instances, as well as the risk of overfitting to noise in 
the training set, instance reduction algorithms have been 
proposed. 141 

Gagliardi 151 applies this family of classifiers in medi¬ 
cal field as second-opinion diagnostic tools and as tools 
for the knowledge extraction phase in the process of 
knowledge discovery in databases. One of these classi¬ 
fiers (called Prototype exemplar learning classifier (PEL- 
C) is able to extract a mixture of abstracted prototypical 
cases (that are syndromes) and selected atypical clinical 
cases. 


45.2 References 
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45.1 See also 


• Analogical modeling 
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k-nearest neighbors algorithm 


In pattern recognition, the A-Nearest Neighbors algo¬ 
rithm (or A-NN for short) is a non-parametric method 
used for classification and regression. 1 11 In both cases, the 
input consists of the A closest training examples in the 
feature space. The output depends on whether A-NN is 
used for classification or regression: 

• In k-NN classification, the output is a 
class membership. An object is clas¬ 
sified by a majority vote of its neigh¬ 
bors, with the object being assigned to 
the class most common among its k near¬ 
est neighbors (k is a positive integer, typ¬ 
ically small). If k = 1, then the object is 
simply assigned to the class of that single 
nearest neighbor. 

• In k-NN regression, the output is the prop¬ 
erty value for the object. This value is 
the average of the values of its k nearest 
neighbors. 

A-NN is a type of instance-based learning, or lazy learn¬ 
ing, where the function is only approximated locally and 
all computation is deferred until classification. The A-NN 
algorithm is among the simplest of all machine learning 
algorithms. 

Both for classification and regression, it can be useful to 
assign weight to the contributions of the neighbors, so that 
the nearer neighbors contribute more to the average than 
the more distant ones. For example, a common weighting 
scheme consists in giving each neighbor a weight of 1 Id, 
where d is the distance to the neighbor. 121 

The neighbors are taken from a set of objects for which 
the class (for A-NN classification) or the object prop¬ 
erty value (for A-NN regression) is known. This can be 
thought of as the training set for the algorithm, though no 
explicit training step is required. 

A shortcoming of the A-NN algorithm is that it is sensi¬ 
tive to the local structure of the data. The algorithm has 
nothing to do with and is not to be confused with A-means, 
another popular machine learning technique. 


46.1 Algorithm 




Example of k-AW classification. The test sample (green circle) 
should be classified either to the first class of blue squares or to 
the second class of red triangles. If k = 3 (solid line circle) it 
is assigned to the second class because there are 2 triangles and 
only 1 square inside the inner circle. If k = 5 (dashed line circle) 
it is assigned to the first class (3 squares vs. 2 triangles inside the 
outer circle). 

The training examples are vectors in a multidimensional 
feature space, each with a class label. The training phase 
of the algorithm consists only of storing the feature vec¬ 
tors and class labels of the training samples. 

In the classification phase, A is a user-defined constant, 
and an unlabeled vector (a query or test point) is classified 
by assigning the label which is most frequent among the 
A training samples nearest to that query point. 

A commonly used distance metric for continuous vari¬ 
ables is Euclidean distance. For discrete variables, such 
as for text classification, another metric can be used, such 
as the overlap metric (or Hamming distance). In the 
context of gene expression microarray data, for exam¬ 
ple, A-NN has also been employed with correlation co¬ 
efficients such as Pearson and Spearman. 131 Often, the 
classification accuracy of A-NN can be improved signifi¬ 
cantly if the distance metric is learned with specialized 
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algorithms such as Large Margin Nearest Neighbor or 
Neighbourhood components analysis. 

A drawback of the basic “majority voting” classification 
occurs when the class distribution is skewed. That is, 
examples of a more frequent class tend to dominate the 
prediction of the new example, because they tend to be 
common among the A nearest neighbors due to their large 
number. 141 One way to overcome this problem is to weight 
the classification, taking into account the distance from 
the test point to each of its k nearest neighbors. The class 
(or value, in regression problems) of each of the k nearest 
points is multiplied by a weight proportional to the inverse 
of the distance from that point to the test point. Another 
way to overcome skew is by abstraction in data repre¬ 
sentation. For example in a self-organizing map (SOM), 
each node is a representative (a center) of a cluster of 
similar points, regardless of their density in the original 
training data. X”-NN can then be applied to the SOM. 


46.2 Parameter selection 

The best choice of k depends upon the data; gener¬ 
ally, larger values of k reduce the effect of noise on the 
classification, 151 but make boundaries between classes less 
distinct. A good k can be selected by various heuristic 
techniques (see hyperparameter optimization). The spe¬ 
cial case where the class is predicted to be the class of 
the closest training sample (i.e. when k = 1) is called the 
nearest neighbor algorithm. 

The accuracy of the A-NN algorithm can be severely de¬ 
graded by the presence of noisy or irrelevant features, or 
if the feature scales are not consistent with their impor¬ 
tance. Much research effort has been put into selecting or 
scaling features to improve classification. A particularly 
popular approach is the use of evolutionary algorithms to 
optimize feature scaling. 161 Another popular approach is 
to scale features by the mutual information of the training 
data with the training classes. 

In binary (two class) classification problems, it is helpful 
to choose k to be an odd number as this avoids tied votes. 
One popular way of choosing the empirically optimal k in 
this setting is via bootstrap method. 171 


46.3 Properties 

A-NN is a special case of a variable-bandwidth, kernel 
density “balloon” estimator with a uniform kernel. 181 171 

The naive version of the algorithm is easy to implement 
by computing the distances from the test example to all 
stored examples, but it is computationally intensive for 
large training sets. Using an appropriate nearest neighbor 
search algorithm makes A-NN computationally tractable 
even for large data sets. Many nearest neighbor search 


algorithms have been proposed over the years; these gen¬ 
erally seek to reduce the number of distance evaluations 
actually performed. 

A-NN has some strong consistency results. As the amount 
of data approaches infinity, the algorithm is guaranteed to 
yield an error rate no worse than twice the Bayes error rate 
(the minimum achievable error rate given the distribution 
of the data). 1 101 A-NN is guaranteed to approach the Bayes 
error rate for some value of A (where A increases as a 
function of the number of data points). Various improve¬ 
ments to A-NN are possible by using proximity graphs. 1111 

46.4 Metric Learning 

The K-nearest neighbor classification performance can 
often be significantly improved through (supervised) 
metric learning. Popular algorithms are Neighbourhood 
components analysis and Large margin nearest neighbor. 
Supervised metric learning algorithms use the label infor¬ 
mation to learn a new metric or pseudo-metric. 

46.5 Feature extraction 

When the input data to an algorithm is too large to be 
processed and it is suspected to be notoriously redundant 
(e.g. the same measurement in both feet and meters) 
then the input data will be transformed into a reduced 
representation set of features (also named features vec¬ 
tor). Transforming the input data into the set of features 
is called feature extraction. If the features extracted are 
carefully chosen it is expected that the features set will ex¬ 
tract the relevant information from the input data in order 
to perform the desired task using this reduced represen¬ 
tation instead of the full size input. Feature extraction is 
performed on raw data prior to applying A-NN algorithm 
on the transformed data in feature space. 

An example of a typical computer vision computation 
pipeline for face recognition using A-NN including fea¬ 
ture extraction and dimension reduction pre-processing 
steps (usually implemented with OpenCV): 

1. Haar face detection 

2. Mean-shift tracking analysis 

3. PCA or Fisher LDA projection into feature space, 
followed by A-NN classification 

46.6 Dimension reduction 

For high-dimensional data (e.g., with number of dimen¬ 
sions more than 10) dimension reduction is usually per¬ 
formed prior to applying the A-NN algorithm in order to 
avoid the effects of the curse of dimensionality. 1121 
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The curse of dimensionality in the A-NN context basically 
means that Euclidean distance is unhelpful in high di¬ 
mensions because all vectors are almost equidistant to the 
search query vector (imagine multiple points lying more 
or less on a circle with the query point at the center; the 
distance from the query to all data points in the search 
space is almost the same). 

Feature extraction and dimension reduction can be com¬ 
bined in one step using principal component anal¬ 
ysis (PCA), linear discriminant analysis (LDA), or 
canonical correlation analysis (CCA) techniques as a 
pre-processing step, followed by clustering by A-NN on 
feature vectors in reduced-dimension space. In machine 
learning this process is also called low-dimensional 
embedding. 1131 

For very-high-dimensional datasets (e.g. when perform¬ 
ing a similarity search on live video streams, DNA 
data or high-dimensional time series) running a fast ap¬ 
proximate A-NN search using locality sensitive hashing, 
“random projections”, 1141 “sketches” 1151 or other high¬ 
dimensional similarity search techniques from VLDB 
toolbox might be the only feasible option. 


46.7 Decision boundary 

Nearest neighbor rules in effect implicitly compute the 
decision boundary. It is also possible to compute the de¬ 
cision boundary explicitly, and to do so efficiently, so that 
the computational complexity is a function of the bound¬ 
ary complexity. 1161 


46.8 Data reduction 

Data reduction is one of the most important problems for 
work with huge data sets. Usually, only some of the data 
points are needed for accurate classification. Those data 
are called the prototypes and can be found as follows: 


1. Select the class-outliers, that is, training data that are 
classified incorrectly by A-NN (for a given A) 

2. Separate the rest of the data into two sets: (i) the 
prototypes that are used for the classification deci¬ 
sions and (ii) the absorbed points that can be cor¬ 
rectly classified by A-NN using prototypes. The ab¬ 
sorbed points can then be removed from the training 
set. 


• random error 

• insufficient training examples of this class (an iso¬ 
lated example appears instead of a cluster) 

• missing important features (the classes are separated 
in other dimensions which we do not know) 

• too many training examples of other classes (unbal¬ 
anced classes) that create a “hostile” background for 
the given small class 


Class outliers with A-NN produce noise. They can be 
detected and separated for future analysis. Given two 
natural numbers, lc>r>0, a training example is called a 
(A,r)NN class-outlier if its A nearest neighbors include 
more than r examples of other classes. 


46.8.2 CNN for data reduction 

Condensed nearest neighbor (CNN, the Hart algorithm ) 
is an algorithm designed to reduce the data set for A- 
NN classification. 1171 It selects the set of prototypes U 
from the training data, such that INN with U can clas¬ 
sify the examples almost as accurately as INN does with 
the whole data set. 


* a 



Calculation of the border ratio. 




■-Prototype 
x-Outlier 
o- Absorbed 


46.8.1 Selection of class-outliers 

A training example surrounded by examples of other 
classes is called a class outlier. Causes of class outliers 
include: 


Three types of points: prototypes, class-outliers, and absorbed 
points. 

Given a training set X, CNN works iteratively: 
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1. Scan all elements of X, looking for an element x 
whose nearest prototype from U has a different label 
than x. 

2. Remove x from X and add it to U 

3. Repeat the scan until no more prototypes are added 
to U. 

Use U instead of X for classification. The examples that 
are not prototypes are called “absorbed” points. 

It is efficient to scan the training examples in order of 
decreasing border ratio. 1181 The border ratio of a training 
example x is defined as 

a(x) = llx'-yll / llx-yll 

where llx-yll is the distance to the closest example y having 
a different color than x, and llx'-yll is the distance from y 
to its closest example x' with the same label as x. 

The border ratio is in the interval [0,1] because llx'-yll 
never exceeds Ibc-yll. This ordering gives preference to 
the borders of the classes for inclusion in the set of proto¬ 
types U. A point of a different label than x is called exter¬ 
nal to x. The calculation of the border ratio is illustrated 
by the figure on the right. The data points are labeled by 
colors: the initial point is x and its label is red. External 
points are blue and green. The closest to x external point 
is y. The closest to y red point is x'. The border ratio 
a(x)=llx'-vll/llx-yll is the attribute of the initial point x. 

Below is an illustration of CNN in a series of figures. 
There are three classes (red, green and blue). Fig. 1: 
initially there are 60 points in each class. Fig. 2 shows 
the INN classification map: each pixel is classified by 
INN using all the data. Fig. 3 shows the 5NN classifi¬ 
cation map. White areas correspond to the unclassified 
regions, where 5NN voting is tied (for example, if there 
are two green, two red and one blue points among 5 near¬ 
est neighbors). Fig. 4 shows the reduced data set. The 
crosses are the class-outliers selected by the (3,2)NN rule 
(all the three nearest neighbors of these instances belong 
to other classes); the squares are the prototypes, and the 
empty circles are the absorbed points. The left bottom 
corner shows the numbers of the class-outliers, proto¬ 
types and absorbed points for all three classes. The num¬ 
ber of prototypes varies from 15% to 20% for different 
classes in this example. Fig. 5 shows that the INN clas¬ 
sification map with the prototypes is very similar to that 
with the initial data set. The figures were produced using 
the Mirkes applet. |18] 

• CNN model reduction for k-NN classifiers 

• Fig. 1. The dataset. 

• Fig. 2. The INN classification map. 

• Fig. 3. The 5NN classification map. 


• Fig. 4. The CNN reduced dataset. 

• Fig. 5. The INN classification map based on the 
CNN extracted prototypes. 

46.9 A>NN regression 

In k-NN regression, the CNN algorithm is used for es¬ 
timating continuous variables. One such algorithm uses 
a weighted average of the k nearest neighbors, weighted 
by the inverse of their distance. This algorithm works as 
follows: 

1. Compute the Euclidean or Mahalanobis distance 
from the query example to the labeled examples. 

2. Order the labeled examples by increasing distance. 

3. Find a heuristically optimal number k of nearest 
neighbors, based on RMSE. This is done using cross 
validation. 

4. Calculate an inverse distance weighted average with 
the A-nearest multivariate neighbors. 

46.10 Validation of results 

A confusion matrix or “matching matrix” is often used 
as a tool to validate the accuracy of k-NN classification. 
More robust statistical methods such as likelihood-ratio 
test can also be applied. 

46.11 See also 

• Instance-based learning 

• Nearest neighbor search 

• Statistical classification 

• Cluster analysis 

• Data mining 

• Nearest centroid classifier 

• Pattern recognition 

• Curse of dimensionality 

• Dimension reduction 

• Principal Component Analysis 

• Focality Sensitive Hashing 

• MinHash 

• Cluster hypothesis 

• Closest pair of points problem 
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Chapter 47 

Principal component analysis 



PCA of a multivariate Gaussian distribution centered at (1,3) 
with a standard deviation of 3 in roughly the (0.878, 0.478) di¬ 
rection and of 1 in the orthogonal direction. The vectors shown 
are the eigenvectors of the covariance matrix scaled by the square 
root of the corresponding eigenvalue, and shifted so their tails are 
at the mean. 

Principal component analysis (PCA) is a statistical 
procedure that uses an orthogonal transformation to con¬ 
vert a set of observations of possibly correlated vari¬ 
ables into a set of values of linearly uncorrelated variables 
called principal components. The number of principal 
components is less than or equal to the number of origi¬ 
nal variables. This transformation is defined in such a way 
that the first principal component has the largest possible 
variance (that is, accounts for as much of the variability 
in the data as possible), and each succeeding component 
in turn has the highest variance possible under the con¬ 
straint that it is orthogonal to the preceding components. 
The resulting vectors are an uncorrelated orthogonal ba¬ 
sis set. The principal components are orthogonal because 
they are the eigenvectors of the covariance matrix, which 
is symmetric. PCA is sensitive to the relative scaling of 
the original variables. 

Depending on the field of application, it is also named 
the discrete Karhunen-Loeve transform (KLT) in sig¬ 
nal processing, the Hotelling transform in multivariate 
quality control, proper orthogonal decomposition (POD) 


in mechanical engineering, singular value decomposition 
(SVD) of X (Golub and Van Loan, 1983), eigenvalue de¬ 
composition (EVD) of X T X in linear algebra, factor anal¬ 
ysis (for a discussion of the differences between PCA 
and factor analysis see Ch. 7 of 1 11 ), Eckart-Young the¬ 
orem (Harman, 1960), or Schmidt-Mirsky theorem in 
psychometrics, empirical orthogonal functions (EOF) in 
meteorological science, empirical eigenfunction decom¬ 
position (Sirovich, 1987), empirical component analy¬ 
sis (Lorenz, 1956), quasiharmonic modes (Brooks et al., 
1988), spectral decomposition in noise and vibration, and 
empirical modal analysis in structural dynamics. 

PCA was invented in 1901 by Karl Pearson, 121 as an ana¬ 
logue of the principal axis theorem in mechanics; it was 
later independently developed (and named) by Harold 
Hotelling in the 1930s. 131 The method is mostly used as a 
tool in exploratory data analysis and for making predictive 
models. PCA can be done by eigenvalue decomposition 
of a data covariance (or correlation) matrix or singular 
value decomposition of a data matrix, usually after mean 
centering (and normalizing or using Z-scores) the data 
matrix for each attribute. 141 The results of a PCA are 
usually discussed in terms of component scores, some¬ 
times called factor scores (the transformed variable val¬ 
ues corresponding to a particular data point), and loadings 
(the weight by which each standardized original variable 
should be multiplied to get the component score). 151 

PCA is the simplest of the true eigenvector-based multi¬ 
variate analyses. Often, its operation can be thought of 
as revealing the internal structure of the data in a way 
that best explains the variance in the data. If a multivari¬ 
ate dataset is visualised as a set of coordinates in a high¬ 
dimensional data space (1 axis per variable), PCA can 
supply the user with a lower-dimensional picture, a pro¬ 
jection or “shadow” of this object when viewed from its 
(in some sense; see below) most informative viewpoint. 
This is done by using only the first few principal compo¬ 
nents so that the dimensionality of the transformed data 
is reduced. 

PCA is closely related to factor analysis. Factor analysis 
typically incorporates more domain specific assumptions 
about the underlying structure and solves eigenvectors of 
a slightly different matrix. 
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PCA is also related to canonical correlation analysis 
(CCA). CCA defines coordinate systems that optimally 
describe the cross-covariance between two datasets while 
PCA defines a new orthogonal coordinate system that op¬ 
timally describes variance in a single dataset. 161171 


47.1 Intuition 

PCA can be thought of as fitting an n-dimensional 
ellipsoid to the data, where each axis of the ellipsoid rep¬ 
resents a principal component. If some axis of the ellipse 
is small, then the variance along that axis is also small, 
and by omitting that axis and its corresponding princi¬ 
pal component from our representation of the dataset, we 
lose only a commensurately small amount of information. 

To find the axes of the ellipse, we must first subtract the 
mean of each variable from the dataset to center the data 
around the origin. Then, we compute the covariance ma¬ 
trix of the data, and calculate the eigenvalues and corre¬ 
sponding eigenvectors of this covariance matrix. Then, 
we must orthogonalize the set of eigenvectors, and nor¬ 
malize each to become unit vectors. Once this is done, 
each of the mutually orthogonal, unit eigenvectors can be 
interpreted as an axis of the ellipsoid fitted to the data. 
The proportion of the variance that each eigenvector rep¬ 
resents can be calculated by dividing the eigenvalue cor¬ 
responding to that eigenvector by the sum of all eigenval¬ 
ues. 

It is important to note that this procedure is sensitive to 
the scaling of the data, and that there is no consensus as 
to how to best scale the data to obtain optimal results. 


47.2 Details 

PCA is mathematically defined 111 as an orthogonal 
linear transformation that transforms the data to a new 
coordinate system such that the greatest variance by some 
projection of the data comes to lie on the first coordinate 
(called the first principal component), the second greatest 
variance on the second coordinate, and so on. 

Consider a data matrix, X, with column-wise zero 
empirical mean (the sample mean of each column has 
been shifted to zero), where each of the n rows repre¬ 
sents a different repetition of the experiment, and each 
of the p columns gives a particular kind of datum (say, 
the results from a particular sensor). 

Mathematically, the transformation is defined by a set 
of p-dimensional vectors of weights or loadings w(*.) = 
(wi,... ,w p )(k) that map each row vector of X 
to a new vector of principal component scores t/j) = 
(fi,...,f p ) w .givenby 


in such a way that the individual variables of t consid¬ 
ered over the data set successively inherit the maximum 
possible variance from x, with each loading vector w con¬ 
strained to be a unit vector. 


47.2.1 First component 

The first loading vector w<n thus has to satisfy 


W(i) = arg max < V (h)! 
Ilw||=l 


arg max 

II wll=1 


E 


Hi )'' 


Equivalently, writing this in matrix form gives 


w (1 ) = arg max{||Xw|| 2 } = arg max {w T X T Xw} 
IW|=i l|w||=l 

Since W(p has been defined to be a unit vector, it equiva¬ 
lently also satisfies 


w T X T Xw 


Wm = arg max , „ 

v 1 w J w 


The quantity to be maximised can be recognised as a 
Rayleigh quotient. A standard result for a symmetric ma¬ 
trix such as X T X is that the quotient’s maximum possible 
value is the largest eigenvalue of the matrix, which occurs 
when w is the corresponding eigenvector. 

With W(d found, the first component of a data vector X(tj 
can then be given as a score t\(i) = xd, ■ W(p in the trans¬ 
formed co-ordinates, or as the corresponding vector in 
the original variables, {x(i) ■ W(i>} W(p. 

47.2.2 Further components 

The Ath component can be found by subtracting the first 
A - 1 principal components from X: 


fc-i 


± k =X-J2* w 


(*) W ( S ) 


and then finding the loading vector which extracts the 
maximum variance from this new data matrix 


W( fe ) = arg max 

l|w|| = l 


{l|£fcw|| 2 } = arg max {ESE} 


tk(i) = X(i) • W( fc ) 


It turns out that this gives the remaining eigenvectors of 
X T X, with the maximum values for the quantity in brack¬ 
ets given by their corresponding eigenvalues. 

The Ath principal component of a data vector xd, can 
therefore be given as a score tk(i) = X(tj • w<A> in the trans¬ 
formed co-ordinates, or as the corresponding vector in the 
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space of the original variables, { xff ■ W(fc)} W(/c>, where 
w (k) is the A'th eigenvector of X T X. 

The full principal components decomposition of X can 
therefore be given as 


T = XW 

where W is a p-by-p matrix whose columns are the eigen¬ 
vectors of X T X 


47.2.3 Covariances 

X T X itself can be recognised as proportional to the em¬ 
pirical sample covariance matrix of the dataset X. 

The sample covariance Q between two of the different 
principal components over the dataset is given by: 


< 2( PC 0')’ PC (fc)) oc ( Xw 0')) T • ( Xw (fc)) 

= w 0')X T Xw (fe) 

= w S) A W w W 

= A( fe) w^ ) w (fc) 

where the eigenvalue property of W(&) has been used to 
move from line 2 to line 3. However eigenvectors w (j) and 
w (k) corresponding to eigenvalues of a symmetric matrix 
are orthogonal (if the eigenvalues are different), or can be 
orthogonalised (if the vectors happen to share an equal 
repeated value). The product in the final line is there¬ 
fore zero; there is no sample covariance between differ¬ 
ent principal components over the dataset. 

Another way to characterise the principal components 
transformation is therefore as the transformation to co¬ 
ordinates which diagonalise the empirical sample covari¬ 
ance matrix. 

In matrix form, the empirical covariance matrix for the 
original variables can be written 


Q oc X T X = W0W T 

The empirical covariance matrix between the principal 
components becomes 


W t QW oc W t W 0 W t W = 0 

where A is the diagonal matrix of eigenvalues )pk) of X T X 

(X(k) being equal to the sum of the squares over the dataset 
associated with each component k: hk, = 2/ tk 2 (i, = 2 i 
(x ( /) ■ w<k>) 2 ) 


47.2.4 Dimensionality reduction 

The faithful transformation T = X W maps a data vector 
X(i) from an original space of p variables to a new space of 
p variables which are uncorrelated over the dataset. How¬ 
ever, not all the principal components need to be kept. 
Keeping only the first L principal components, produced 
by using only the first L loading vectors, gives the trun¬ 
cated transformation 


T L = XW L 

where the matrix TL now has n rows but only L columns. 
In other words, PCA learns a linear transformation t = 
W T x, x £ R p , t £ R L , where the columns of p x L ma¬ 
trix W form an orthogonal basis for the L features (the 
components of representation t) that are decorrelated. 181 
By construction, of all the transformed data matrices 
with only L columns, this score matrix maximises the 
variance in the original data that has been preserved, 
while minimising the total squared reconstruction error 
||TW T -T L W£||lor||X-X L ||l. 
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A principal components analysis scatterplot of Y-STR haplotypes 
calculated from repeat-count values for 37 Y-chromosomal STR 
markers from 354 individuals. 

PCA has successfully found linear combinations of the differ¬ 
ent markers, that separate out different dusters corresponding to 
different lines of individuals’ Y-chromosomal genetic descent. 

Such dimensionality reduction can be a very useful step 
for visualising and processing high-dimensional datasets, 
while still retaining as much of the variance in the dataset 
as possible. For example, selecting L = 2 and keeping 
only the first two principal components finds the two- 
dimensional plane through the high-dimensional dataset 
in which the data is most spread out, so if the data contains 
clusters these too may be most spread out, and therefore 
most visible to be plotted out in a two-dimensional dia¬ 
gram; whereas if two directions through the data (or two 
of the original variables) are chosen at random, the clus¬ 
ters may be much less spread apart from each other, and 
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may in fact be much more likely to substantially overlay 
each other, making them indistinguishable. 

Similarly, in regression analysis, the larger the number of 
explanatory variables allowed, the greater is the chance 
of overfitting the model, producing conclusions that fail 
to generalise to other datasets. One approach, especially 
when there are strong correlations between different pos¬ 
sible explanatory variables, is to reduce them to a few 
principal components and then run the regression against 
them, a method called principal component regression. 

Dimensionality reduction may also be appropriate when 
the variables in a dataset are noisy. If each column of 
the dataset contains independent identically distributed 
Gaussian noise, then the columns of T will also contain 
similarly identically distributed Gaussian noise (such a 
distribution is invariant under the effects of the matrix 

W, which can be thought of as a high-dimensional rota¬ 
tion of the co-ordinate axes). However, with more of the 
total variance concentrated in the first few principal com¬ 
ponents compared to the same noise variance, the pro¬ 
portionate effect of the noise is less—the first few com¬ 
ponents achieve a higher signal-to-noise ratio. PCA thus 
can have the effect of concentrating much of the signal 
into the first few principal components, which can use¬ 
fully be captured by dimensionality reduction; while the 
later principal components may be dominated by noise, 
and so disposed of without great loss. 

47.2.5 Singular value decomposition 

The principal components transformation can also be as¬ 
sociated with another matrix factorisation, the singular 
value decomposition (SVD) of X, 

X = U0W T 

Here X is a n-by-p rectangular diagonal matrix of posi¬ 
tive numbers <j(k), called the singular values of X; U is 
an n-by-n matrix, the columns of which are orthogonal 
unit vectors of length n called the left singular vectors of 
X; and W is a p-by-p whose columns are orthogonal unit 
vectors of length p and called the right singular vectors of 

X. 

In terms of this factorisation, the matrix X T X can be writ¬ 
ten 

X T X = W0U T U0W T 
= W0 2 W T 

Comparison with the eigenvector factorisation of X T X 
establishes that the right singular vectors W of X are 
equivalent to the eigenvectors of X T X, while the singu¬ 
lar values 0 (k) of X are equal to the square roots of the 
eigenvalues hk) of X T X. 


Using the singular value decomposition the score matrix 
T can be written 

T = XW 
= U0W T W 
= U0 

so each column of T is given by one of the left singu¬ 
lar vectors of X multiplied by the corresponding singular 
value. This form is also the polar decomposition of T. 

Efficient algorithms exist to calculate the SVD of X with¬ 
out having to form the matrix X T X, so computing the 
SVD is now the standard way to calculate a principal com¬ 
ponents analysis from a data matrix, unless only a handful 
of components are required. 

As with the eigen-decomposition, a truncated n x L score 
matrix TL can be obtained by considering only the first L 
largest singular values and their singular vectors: 


T L = u L 0I = XW L 

The truncation of a matrix M or T using a truncated sin¬ 
gular value decomposition in this way produces a trun¬ 
cated matrix that is the nearest possible matrix of rank 
L to the original matrix, in the sense of the difference 
between the two having the smallest possible Frobenius 
norm, a result known as the Eckart-Young theorem 
[1936]. 


47.3 Further considerations 

Given a set of points in Euclidean space, the first princi¬ 
pal component corresponds to a line that passes through 
the multidimensional mean and minimizes the sum of 
squares of the distances of the points from the line. The 
second principal component corresponds to the same 
concept after all correlation with the first principal com¬ 
ponent has been subtracted from the points. The singular 
values (in X) are the square roots of the eigenvalues of 
the matrix X T X. Each eigenvalue is proportional to the 
portion of the “variance” (more correctly of the sum of 
the squared distances of the points from their multidi¬ 
mensional mean) that is correlated with each eigenvec¬ 
tor. The sum of all the eigenvalues is equal to the sum 
of the squared distances of the points from their multidi¬ 
mensional mean. PCA essentially rotates the set of points 
around their mean in order to align with the principal 
components. This moves as much of the variance as pos¬ 
sible (using an orthogonal transformation) into the first 
few dimensions. The values in the remaining dimensions, 
therefore, tend to be small and may be dropped with min¬ 
imal loss of information (see below). PCA is often used 
in this manner for dimensionality reduction. PCA has the 
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distinction of being the optimal orthogonal transforma¬ 
tion for keeping the subspace that has largest “variance” 
(as defined above). This advantage, however, comes at 
the price of greater computational requirements if com¬ 
pared, for example and when applicable, to the discrete 
cosine transform, and in particular to the DCT-II which 
is simply known as the “DCT”. Nonlinear dimensional¬ 
ity reduction techniques tend to be more computationally 
demanding than PCA. 

PCA is sensitive to the scaling of the variables. If we 
have just two variables and they have the same sample 
variance and are positively correlated, then the PCA will 
entail a rotation by 45° and the “loadings” for the two 
variables with respect to the principal component will be 
equal. But if we multiply all values of the first variable 
by 100, then the first principal component will be almost 
the same as that variable, with a small contribution from 
the other variable, whereas the second component will 
be almost aligned with the second original variable. This 
means that whenever the different variables have differ¬ 
ent units (like temperature and mass), PCA is a somewhat 
arbitrary method of analysis. (Different results would be 
obtained if one used Fahrenheit rather than Celsius for 
example.) Note that Pearson’s original paper was entitled 
“On Lines and Planes of Closest Fit to Systems of Points 
in Space” - “in space” implies physical Euclidean space 
where such concerns do not arise. One way of making 
the PCA less arbitrary is to use variables scaled so as to 
have unit variance, by standardizing the data and hence 
use the autocorrelation matrix instead of the autocovari¬ 
ance matrix as a basis for PCA. However, this compresses 
(or expands) the fluctuations in all dimensions of the sig¬ 
nal space to unit variance. 

Mean subtraction (a.k.a. “mean centering”) is neces¬ 
sary for performing PCA to ensure that the first principal 
component describes the direction of maximum variance. 
If mean subtraction is not performed, the first principal 
component might instead correspond more or less to the 
mean of the data. A mean of zero is needed for finding 
a basis that minimizes the mean square error of the ap¬ 
proximation of the data. 191 

PCA is equivalent to empirical orthogonal functions 
(EOF), a name which is used in meteorology. 

An autoencoder neural network with a linear hidden layer 
is similar to PCA. Upon convergence, the weight vectors 
of the K neurons in the hidden layer will form a basis for 
the space spanned by the first K principal components. 
Unlike PCA, this technique will not necessarily produce 
orthogonal vectors. 

PCA is a popular primary technique in pattern recog¬ 
nition. It is not, however, optimized for class 
separability. 1101 An alternative is the linear discriminant 
analysis, which does take this into account. 


47.4 Table of symbols and abbrevi¬ 
ations 

47.5 Properties and limitations of 
PCA 

47.5.1 Properties [11J 

Property 1 : For any integer q, 1 < q < p, con¬ 
sider the orthogonal linear transformation 

V = B'z 

where y is a q-element vector and IV is a (q x 
p) matrix, and let |7| ;/ = B'0B be the variance- 
covariance matrix for y . Then the trace of 
, denoted tr(H y ) , is maximized by taking B = 

A q , where \ q consists of the first q columns 
of A (B' is the transposition of B) . 

Property 2: Consider again the orthonormal 
transformation 

V = B'a; 

with x,B,A and 171,, defined as before. Then 
tr(Hy) is minimized by taking B = A*, where 
A* consists of the last q columns of A . 

The statistical implication of this property is that the last 
few PCs are not simply unstructured left-overs after re¬ 
moving the important PCs. Because these last PCs have 
variances as small as possible they are useful in their own 
right. They can help to detect unsuspected near-constant 
linear relationships between the elements of x, and they 
may also be useful in regression, in selecting a subset of 
variables from x, and in outlier detection. 

Property 3: (Spectral Decomposition of £) 

0 = Aiaqaq + • • • + ApOZpOtp 

Before we look at its usage, we first look at diagonal ele¬ 
ments, 

p 

VarOj) = J2 Xka lj 

k =1 

Then, perhaps the main statistical implication of the re¬ 
sult is that not only can we decompose the combined vari¬ 
ances of all the elements of x into decreasing contribu¬ 
tions due to each PC, but we can also decompose the 
whole covariance matrix into contributions A k&k&' k from 
each PC. Although not strictly decreasing, the elements 
of A kOtkOt' k will tend to become smaller as k increases, 
as Afeafeaj. decreases for increasing k , whereas the ele¬ 
ments of afc tend to stay 'about the same size'because of 
the normalization constraints: a' k a.k = 1 , k = 1 , • • • . p 
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47.5.2 Limitations 

As noted above, the results of PCA depend on the scaling 
of the variables. A scale-invariant form of PCA has been 
developed. 1121 

The applicability of PCA is limited by certain 
assumptions 1 13 made in its derivation. 

47.5.3 PCA and information theory 

The claim that the PCA used for dimensionality reduction 
preserves most of the information of the data is mislead¬ 
ing. Indeed, without any assumption on the signal model, 
PCA cannot help to reduce the amount of information 
lost during dimensionality reduction, where information 
was measured using Shannon entropy. 1 141 

Under the assumption that 


x = s + n 

i.e., that the data vector x is the sum of the desired 
information-bearing signal s and a noise signal n one can 
show that PCA can be optimal for dimensionality reduc¬ 
tion also from an information-theoretic point-of-view. 

In particular, Linsker showed that if s is Gaussian and n 
is Gaussian noise with a covariance matrix proportional 
to the identity matrix, the PCA maximizes the mutual 
information /(y; s) between the desired information s and 
the dimensionality-reduced output y = W'[x . 1 12 

If the noise is still Gaussian and has a covariance matrix 
proportional to the identity matrix (i.e., the components 
of the vector n are iid), but the information-bearing signal 
s is non-Gaussian (which is a common scenario), PCA at 
least minimizes an upper bound on the information loss, 
which is defined as [16][171 


7(x;s) -/(y;s). 

The optimality of PCA is also preserved if the noise n is 
iid and at least more Gaussian (in terms of the Kullback- 
Leibler divergence) than the information-bearing signal 
s . 1181 In general, even if the above signal model holds, 
PCA loses its information-theoretic optimality as soon as 
the noise n becomes dependent. 

47.6 Computing PCA using the co- 
variance method 

The following is a detailed description of PCA using the 
covariance method (see also here) as opposed to the cor¬ 
relation method. 1191 But note that it is better to use the 
singular value decomposition (using standard software). 
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The goal is to transform a given data set X of dimension p 
to an alternative data set Y of smaller dimension L. Equiv¬ 
alently, we are seeking to find the matrix Y, where Y is 
the Karhunen-Loeve transform (KLT) of matrix X: 


Y = KLT{X} 

47.6.1 Organize the data set 

Suppose you have data comprising a set of observations 
of p variables, and you want to reduce the data so that 
each observation can be described with only L variables, 
L < p. Suppose further, that the data are arranged as a 
set of n data vectors Xi... x„ with each x, : representing 
a single grouped observation of the p variables. 

• Write Xi.. . x n as row vectors, each of which has p 
columns. 

• Place the row vectors into a single matrix X of di¬ 
mensions n x p. 

47.6.2 Calculate the empirical mean 

• Find the empirical mean along each dimension j = 

1. P- 

• Place the calculated mean values into an empirical 
mean vector u of dimensions p x 1. 


i n 

1 2—1 

47.6.3 Calculate the deviations from the 
mean 

Mean subtraction is an integral part of the solution to¬ 
wards finding a principal component basis that minimizes 
the mean square error of approximating the data. 1201 
Hence we proceed by centering the data as follows: 

• Subtract the empirical mean vector u from each row 
of the data matrix X. 

• Store mean-subtracted data in the n x p matrix B. 

B = X - hu T 

where h is an n x 1 column vector 
of all Is: 


h[i\ = 1 


fori = 1 ,n 
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47.6.4 Find the covariance matrix 

• Find the p x p empirical covariance matrix C from 
the outer product of matrix B with itself: 


C = — B* B 

n — 1 

where * is the conjugate transpose operator. 

Note that if B consists entirely of real num¬ 
bers, which is the case in many applications, 
the “conjugate transpose” is the same as the 
regular transpose. 

• Please note that outer products apply to vectors. For 
tensor cases we should apply tensor products, but the 
covariance matrix in PCA is a sum of outer products 
between its sample vectors; indeed, it could be rep¬ 
resented as B*.B. See the covariance matrix sections 
on the discussion page for more information. 

• The reasoning behind using N - 1 instead of N to 
calculate the covariance is Bessel’s correction 

47.6.5 Find the eigenvectors and eigenval¬ 
ues of the covariance matrix 

• Compute the matrix Y of eigenvectors which 
diagonalizes the covariance matrix C: 


• Matrix V, also of dimension p x p, contains p col¬ 
umn vectors, each of length p, which represent the 
p eigenvectors of the covariance matrix C. 

• The eigenvalues and eigenvectors are ordered and 
paired. The /'th eigenvalue corresponds to the /'th 
eigenvector. 

47.6.6 Rearrange the eigenvectors and 
eigenvalues 

• Sort the columns of the eigenvector matrix V and 
eigenvalue matrix D in order of decreasing eigen¬ 
value. 

• Make sure to maintain the correct pairings between 
the columns in each matrix. 

47.6.7 Compute the cumulative energy 
content for each eigenvector 

• The eigenvalues represent the distribution of the 
source data’s energy among each of the eigenvec¬ 
tors, where the eigenvectors form a basis for the 
data. The cumulative energy content g for the /'th 
eigenvector is the sum of the energy content across 
all of the eigenvalues from 1 through /': 

9\j] 

T,l=i D l k > k \ for 3 


V _1 CV = D 


47.6.8 Select a subset of the eigenvectors as 
basis vectors 


where D is the diagonal matrix of eigenvalues 
of C. This step will typically involve the use 
of a computer-based algorithm for comput¬ 
ing eigenvectors and eigenvalues. These algo¬ 
rithms are readily available as sub-components 
of most matrix algebra systems, such as R, 
MATLAB, [21][22] Mathematical 235 SciPy, IDL 
(Interactive Data Language), or GNU Octave 
as well as OpenCV. 


• Save the first L columns of Y as the pxL matrix W: 


W[k, l] = V[k, l] for k = l,...,p 1 = 1, ...,L 

where 


1 < L < p. 


• Matrix D will take the form of an p x p diagonal 
matrix, where 


D[k, l} = Afc forfc = l 

is the /'th eigenvalue of the covariance matrix 
C, and 

D[k, 1} = 0 forfc ^ l. 


• Use the vector g as a guide in choosing an appropri¬ 
ate value for L. The goal is to choose a value of L as 
small as possible while achieving a reasonably high 
value of g on a percentage basis. For example, you 
may want to choose L so that the cumulative energy 
g is above a certain threshold, like 90 percent. In 
this case, choose the smallest value of L such that 


a[L] 

g[p\ 


> 0.9 



314 


CHAPTER 47. PRINCIPAL COMPONENT ANALYSIS 


47.6.9 Convert the source data to z-scores 
(optional) 

• Create an p x 1 empirical standard deviation vector s 
from the square root of each element along the main 
diagonal of the diagonalized covariance matrix C. 
(Note, that scaling operations do not commute with 
the KLT thus we must scale by the variances of the 
already-decorrelated vector, which is the diagonal of 

C): 


s = {s\j}} = WC\j,j]} for) = l,...,p 
• Calculate the n x p z-score matrix: 



• Note: While this step is useful for various applica¬ 
tions as it normalizes the data set with respect to its 
variance, it is not integral part of PCA/KLT 

47.6.10 Project the z-scores of the data 
onto the new basis 

• The projected vectors are the columns of the matrix 


T = Z W = KLT{X}. 

• The rows of matrix T represent the Karhunen- 
Loeve transforms (KLT) of the data vectors in the 
rows of matrix X. 

47.7 Derivation of PCA using the 
covariance method 

Let X be a ^-dimensional random vector expressed as col¬ 
umn vector. Without loss of generality, assume X has 
zero mean. 

We want to find (*) ad x d orthonormal transformation 
matrix P so that PX has a diagonal covariant matrix (i.e. 
PX is a random vector with all its distinct components 
pairwise uncorrelated). 

A quick computation assuming P were unitary yields: 

var(PJV) = E [PX (PX)^} 

= E [PX Xtpt] 

= P E[XX t ]P t 
= P var (X)P- 1 


Hence (*) holds if and only if var(X) were diagonalis- 
able by P . 

This is very constructive, as var(X) is guaranteed to be a 
non-negative definite matrix and thus is guaranteed to be 
diagonalisable by some unitary matrix. 

47.7.1 Iterative computation 

In practical implementations especially with high dimen¬ 
sional data (large p), the covariance method is rarely used 
because it is not efficient. One way to compute the first 
principal component efficiently 1241 is shown in the follow¬ 
ing pseudo-code, for a data matrix X with zero mean, 
without ever computing its covariance matrix. 

r = a random vector of length p do c times: s = 0 (a 
vector of length p) for each row xgXs = s + (x- r)x 

r = A return r 

M 

This algorithm is simply an efficient way of calculating 
X T X r, normalizing, and placing the result back in r 
(power iteration). It avoids the np 2 operations of calcu¬ 
lating the covariance matrix, r will typically get close to 
the first principal component of X within a small number 
of iterations, c. (The magnitude of s will be larger after 
each iteration. Convergence can be detected when it in¬ 
creases by an amount too small for the precision of the 
machine.) 

Subsequent principal components can be computed by 
subtracting component r from X (see Gram-Schmidt) 
and then repeating this algorithm to find the next principal 
component. However this simple approach is not numeri¬ 
cally stable if more than a small number of principal com¬ 
ponents are required, because imprecisions in the calcu¬ 
lations will additively affect the estimates of subsequent 
principal components. More advanced methods build on 
this basic idea, as with the closely related Lanczos algo¬ 
rithm. 

One way to compute the eigenvalue that corresponds with 
each principal component is to measure the difference 
in mean-squared-distance between the rows and the cen¬ 
troid, before and after subtracting out the principal com¬ 
ponent. The eigenvalue that corresponds with the com¬ 
ponent that was removed is equal to this difference. 

47.7.2 The NIPALS method 

Main article: Non-linear iterative partial least squares 

For very-high-dimensional datasets, such as those 
generated in the *omics sciences (e.g., genomics, 
metabolomics) it is usually only necessary to compute 
the first few PCs. The non-linear iterative partial least 
squares (NIPALS) algorithm calculates ti and Wi T from 
X. The outer product, tiWi T can then be subtracted from 
X leaving the residual matrix E x . This can be then used 
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to calculate subsequent PCs. 125 This results in a dramatic 
reduction in computational time since calculation of the 
covariance matrix is avoided. 

However, for large data matrices, or matrices that have a 
high degree of column colhnearity, NIPALS suffers from 
loss of orthogonality due to machine precision limitations 
accumulated in each iteration step. 1261 A Gram-Schmidt 
(GS) re-orthogonalization algorithm is applied to both the 
scores and the loadings at each iteration step to eliminate 
this loss of orthogonality. 1271 

47.7.3 Online/sequential estimation 

In an “online” or “streaming” situation with data arriving 
piece by piece rather than being stored in a single batch, it 
is useful to make an estimate of the PCA projection that 
can be updated sequentially. This can be done efficiently, 
but requires different algorithms. 1281 


47.8 PCA and qualitative variables 

In PCA, it is common that we want to introduce qual¬ 
itative variables as supplementary elements. For exam¬ 
ple, many quantitative variables have been measured on 
plants. For these plants, some qualitative variables are 
available as, for example, the species to which the plant 
belongs. These data were subjected to PCA for quantita¬ 
tive variables. When analyzing the results, it is natural to 
connect the principal components to the qualitative vari¬ 
able species. For this, the following results are produced. 

• Identification, on the factorial planes, of the different 
species e.g. using different colors. 

• Representation, on the factorial planes, of the cen¬ 
ters of gravity of plants belonging to the same 
species. 

• For each center of gravity and each axis, p-value to 
judge the significance of the difference between the 
center of gravity and origin. 

These results are what is called introducing a qualitative 
variable as supplementary element. This procedure is de¬ 
tailed in and Husson, Le & Pages 2009 and Pages 2013. 
Few software offer this option in an “automatic” way. 
This is the case of SPAD that historically, following the 
work of Ludovic Lebart, was the first to propose this op¬ 
tion, and the R package FactoMineR. 


47.9 Applications 


47.9.1 Neuroscience 

A variant of principal components analysis is used in 
neuroscience to identify the specific properties of a stim¬ 
ulus that increase a neuron's probability of generating an 
action potential. 1291 This technique is known as spike- 
triggered covariance analysis. In a typical application an 
experimenter presents a white noise process as a stimulus 
(usually either as a sensory input to a test subject, or as 
a current injected directly into the neuron) and records 
a train of action potentials, or spikes, produced by the 
neuron as a result. Presumably, certain features of the 
stimulus make the neuron more likely to spike. In order 
to extract these features, the experimenter calculates the 
covariance matrix of the spike-triggered ensemble, the set 
of all stimuli (defined and discretized over a finite time 
window, typically on the order of 100 ms) that immedi¬ 
ately preceded a spike. The eigenvectors of the differ¬ 
ence between the spike-triggered covariance matrix and 
the covariance matrix of the prior stimulus ensemble (the 
set of all stimuli, defined over the same length time win¬ 
dow) then indicate the directions in the space of stimuli 
along which the variance of the spike-triggered ensemble 
differed the most from that of the prior stimulus ensem¬ 
ble. Specifically, the eigenvectors with the largest posi¬ 
tive eigenvalues correspond to the directions along which 
the variance of the spike-triggered ensemble showed the 
largest positive change compared to the variance of the 
prior. Since these were the directions in which varying 
the stimulus led to a spike, they are often good approxi¬ 
mations of the sought after relevant stimulus features. 

In neuroscience, PCA is also used to discern the identity 
of a neuron from the shape of its action potential. Spike 
sorting is an important procedure because extracellular 
recording techniques often pick up signals from more 
than one neuron. In spike sorting, one first uses PCA to 
reduce the dimensionality of the space of action potential 
waveforms, and then performs clustering analysis to asso¬ 
ciate specific action potentials with individual neurons. 


47.10 Relation between PCA and 
A-means clustering 

It was asserted in 13011311 that the relaxed solution of k- 
means clustering, specified by the cluster indicators, is 
given by the PCA (principal component analysis) prin¬ 
cipal components, and the PCA subspace spanned by 
the principal directions is identical to the cluster cen¬ 
troid subspace. However, that PCA is a useful relax¬ 
ation of k-means clustering was not a new result (see, for 
example, 1321 ), and it is straightforward to uncover coun¬ 
terexamples to the statement that the cluster centroid sub¬ 
space is spanned by the principal directions. 1331 
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47.11 Relation between PC A and 
factor analysis [34] 

Principal component analysis creates variables that are 
linear combinations of the original variables. The new 
variables have the property that the variables are all or¬ 
thogonal. The principal components can be used to find 
clusters in a set of data. PCA is a variance-focused ap¬ 
proach seeking to reproduce the total variable variance, 
in which components reflect both common and unique 
variance of the variable. PCA is generally preferred for 
purposes of data reduction (i.e., translating variable space 
into optimal factor space) but not when the goal is to de¬ 
tect the latent construct or factors. 

Factor analysis is similar to principal component anal¬ 
ysis, in that factor analysis also involves linear combi¬ 
nations of variables. Different from PCA, factor anal¬ 
ysis is a correlation-focused approach seeking to repro¬ 
duce the inter-correlations among variables, in which the 
factors “represent the common variance of variables, ex¬ 
cluding unique variance 1351 ". Factor analysis is generally 
used when the research purpose is detecting data struc¬ 
ture (i.e., latent constructs or factors) or causal modeling. 

47.12 Correspondence analysis 

Correspondence analysis (CA) was developed by Jean- 
Paul Benzecri 135 and is conceptually similar to PCA, but 
scales the data (which should be non-negative) so that 
rows and columns are treated equivalently. It is tradi¬ 
tionally applied to contingency tables. CA decomposes 
the chi-squared statistic associated to this table into or¬ 
thogonal factors. 1371 Because CA is a descriptive tech¬ 
nique, it can be applied to tables for which the chi-squared 
statistic is appropriate or not. Several variants of CA 
are available including detrended correspondence analy¬ 
sis and canonical correspondence analysis. One special 
extension is multiple correspondence analysis, which may 
be seen as the counterpart of principal component analy¬ 
sis for categorical data. 1381 

47.13 Generalizations 

47.13.1 Nonlinear generalizations 

Most of the modern methods for nonlinear dimensional¬ 
ity reduction find their theoretical and algorithmic roots 
in PCA or K-means. Pearson’s original idea was to take 
a straight line (or plane) which will be “the best fit” to a 
set of data points. Principal curves and manifolds 1421 
give the natural geometric framework for PCA general¬ 
ization and extend the geometric interpretation of PCA 
by explicitly constructing an embedded manifold for data 
approximation, and by encoding using standard geomet- 


Basal 
LumA 
LumB 
ERBB2 
Normal 
Unclassified 
Group A 
Group B 

ER- 
ER+ 
a) 


b) ELMAP2D c) PCA2D 

Linear PCA versus nonlinear Principal Manifoldfor 
visualization of breast cancer microarray data: a) Configuration 
of nodes and 2D Principal Surface in the 3D PCA linear mani¬ 
fold. The dataset is curved and cannot be mapped adequately on 
a 2D principal plane; b) The distribution in the internal 2D non¬ 
linear principal surface coordinates (ELMap2D) together with 
an estimation of the density of points; c) The same as b), but for 
the linear 2D PCA manifold (PCA2D). The “basal”breast cancer 
subtype is visualized more adequately with ELMap2D and some 
features of the distribution become better resolved in compari¬ 
son to PCA2D. Principal manifolds are produced by the elastic 
maps algorithm. Data are available for public competitionJ 40 ^ 
Software is available for free non-commercial use/ 41 ^ 

ric projection onto the manifold, as it is illustrated by Fig. 
See also the elastic map algorithm and principal geodesic 
analysis. Another popular generalization is kernel PCA, 
which corresponds to PCA performed in a reproducing 
kernel Hilbert space associated with a positive definite 
kernel. 


47.13.2 Multilinear generalizations 

In multilinear subspace learning, 1431 PCA is generalized 
to multilinear PCA (MPCA) that extracts features di¬ 
rectly from tensor representations. MPCA is solved by 
performing PCA in each mode of the tensor iteratively. 
MPCA has been applied to face recognition, gait recog¬ 
nition, etc. MPCA is further extended to uncorrelated 
MPCA, non-negative MPCA and robust MPCA. 

47.13.3 Higher order 

A'-way principal component analysis may be performed 
with models such as Tucker decomposition, PARAFAC, 
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multiple factor analysis, co-inertia analysis, STATIS, and 
DISTATIS. 

47.13.4 Robustness - weighted PC A 

While PC A finds the mathematically optimal method (as 
in minimizing the squared error), it is sensitive to outliers 
in the data that produce large errors PC A tries to avoid. 
It therefore is common practice to remove outliers be¬ 
fore computing PCA. However, in some contexts, outliers 
can be difficult to identify. For example in data mining 
algorithms like correlation clustering, the assignment of 
points to clusters and outliers is not known beforehand. 
A recently proposed generalization of PCA 1441 based on 
a weighted PCA increases robustness by assigning dif¬ 
ferent weights to data objects based on their estimated 
relevancy. 

47.13.5 Robust PCA via Decomposition in 
Low Rank and Sparse Matrices 

Robust principal component analysis (RPCA) is a modi¬ 
fication of the widely used statistical procedure Principal 
component analysis (PCA) which works well with respect 
to grossly corrupted observations. 

47.13.6 Sparse PCA 

A particular disadvantage of PCA is that the principal 
components are usually linear combinations of all input 
variables. Sparse PCA overcomes this disadvantage by 
finding linear combinations that contain just a few input 
variables. 


47.14 Software/source code 

• An Open Source Code and Tutorial in MATLAB 
and C++. 

• FactoMineR - Probably the more complete library 
of functions for exploratory data analysis. 

• XLSTAT - Principal Compent Analysis is a part of 
XLSTAT core module [45] 

• Mathematica - Implements principal compo¬ 
nent analysis with the PrincipalComponents 
command 1461 using both covariance and correlation 
methods. 

• DataMelt - A Java free program that implements 
several classes to build PCA analysis and to calcu¬ 
late eccentricity of random distributions. 

• NAG Library - Principal components analysis is 
implemented via the g03aa routine (available in both 
the Fortran 1471 and the C |4X| versions of the Library). 


• SIMCA - Commercial software package available 
to perform PCA analysis. 1491 

• CORICO - Commercial software, offers principal 
components analysis coupled with Iconographie des 
correlations. 

• MATLAB Statistics Toolbox - The functions prin- 
comp and pea (R2012b) give the principal compo¬ 
nents, while the function pcares gives the residuals 
and reconstructed matrix for a low-rank PCA ap¬ 
proximation. An example MATLAB implementa¬ 
tion of PCA is available. 1501 

• Oracle Database 12c - Implemented via 

DBMS_DATA_MINING.SVDS_SCORING_MODE 

by specifying setting value SVDS_SCORING_PCA 
[51] 

• GNU Octave - Free software computational en¬ 
vironment mostly compatible with MATLAB, the 
function princomp 1521 gives the principal compo¬ 
nent. 

• R - Free statistical package, the functions 
princomp 1531 and preomp 1541 can be used for 
principal component analysis; preomp uses singular 
value decomposition which generally gives better 
numerical accuracy. Some packages that implement 
PCA in R, include, but are not limited to: ade4, 
vegan. Exposition, and FactoMineR 1551 

• SAS, PROC FACTOR - Offers principal compo¬ 
nents analysis. 1561 

• MLPACK - Provides an implementation of princi¬ 
pal component analysis in C++. 

• XLMiner - The principal components tab can be 
used for principal component analysis. 

• Stata - The pea command provides principal com¬ 
ponents analysis. 1571 

• Cornell Spectrum Imager - Open-source toolset 
built on ImageJ, enables PCA analysis for 3D 
datacubes. ,5S1 

• imDEV - Free Excel addon to calculate principal 
components using R package 15911601 

• ViSta: The Visual Statistics System - Free software 
that provides principal components analysis, simple 
and multiple correspondence analysis. 1611 

• Spectramap - Software to create a biplot using prin¬ 
cipal components analysis, correspondence analysis 
or spectral map analysis. 1621 

• FinMath - .NET numerical library containing an 
implementation of PCA. 1631 


318 


• Unscrambler X - Multivariate analysis software en¬ 
abling Principal Component Analysis (PCA) with 
PCA Projection. { 1641 } 

• OpenCV 1651 

• NMath - Proprietary numerical library containing 
PCA for the .NET Framework. 

• IDL - The principal components can be calculated 
using the function pcomp. 1661 

• Weka - Computes principal components. 1671 

• Qlucore - Commercial software for analyzing mul¬ 
tivariate data with instant response using PCA 15S| 

• Orange (software) - Supports PCA through its Lin¬ 
ear Projection widget. 

• EIGENSOFT - Provides a version of PCA adapted 
for population genetics analysis. 1691 

• Partek Genomics Suite - Statistical software able to 
perform PCA. [70] 

• libpca C++ library - Offers PCA and corresponding 
transformations. 

• Origin - Contains PCA in its Pro version. 

• Scikit-learn - Python library for machine learn¬ 
ing which contains PCA, Probabilistic PCA, Kernel 
PCA, Sparse PCA and other techniques in the de¬ 
composition module. 1711 

• Knime 1721 - A java based nodal arrenging software 
for Analysis, in this the nodes called PCA, PCA 
compute, PCA Apply, PCA inverse make it easily. 

• Julia - Supports PCA with the pea function in the 
MultivariateStats package 1731 

• Netflix Surus - Provides a Java implementation of 
robust PCA with wrappers for Pig. 

• Insightomics - Run principal component analysis di¬ 
rectly on your browser. 

47.15 See also 

• Correspondence analysis (for contingency tables) 

• Multiple correspondence analysis (for qualitative 
variables) 

• Factor analysis of mixed data (for quantitative and 
qualitative variables) 

• Canonical correlation 

• CUR matrix approximation (can replace of low- 
rank SVD approximation) 
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• Detrended correspondence analysis 

• Dynamic mode decomposition 

• Eigenface 

• Exploratory factor analysis (Wikiversity) 

• Factorial code 

• Functional principal component analysis 

• Geometric data analysis 

• Independent component analysis 

• Kernel PCA 

• Low-rank approximation 

• Matrix decomposition 

• Non-negative matrix factorization 

• Nonlinear dimensionality reduction 

• Oja’s rule 

• Point distribution model (PCA applied to morphom¬ 
etry and computer vision) 

• Principal component analysis (Wikibooks) 

• Principal component regression 

• Singular spectrum analysis 

• Singular value decomposition 

• Sparse PCA 

• Transform coding 

• Weighted least squares 
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• University of Copenhagen video by Rasmus Bro on 
YouTube 

• Stanford University video by Andrew Ng on 
YouTube 

• A Tutorial on Principal Component Analysis 

• A layman’s introduction to principal component 
analysis on YouTube (a video of less than 100 sec¬ 
onds.) 

• See also the list of Software implementations 


Chapter 48 

Dimensionality reduction 


For dimensional reduction in physics, see Dimensional 
reduction. 

In machine learning and statistics, dimensionality 
reduction or dimension reduction is the process 
of reducing the number of random variables under 
consideration,' 1 ' and can be divided into feature selection 
and feature extraction. 121 


48.1 Feature selection 

Main article: Feature selection 

Feature selection approaches try to find a subset of the 
original variables (also called features or attributes). Two 
strategies are filter (e.g. information gain) and wrapper 
(e.g. search guided by the accuracy) approaches. See also 
combinatorial optimization problems. 

In some cases, data analysis such as regression or 
classification can be done in the reduced space more ac¬ 
curately than in the original space. 

48.2 Feature extraction 

Main article: Feature extraction 

Feature extraction transforms the data in the high¬ 
dimensional space to a space of fewer dimensions. The 
data transformation may be linear, as in principal com¬ 
ponent analysis (PCA), but many nonlinear dimensional¬ 
ity reduction techniques also exist. 13 ' 141 For multidimen¬ 
sional data, tensor representation can be used in dimen¬ 
sionality reduction through multilinear subspace learn¬ 
ing.' 5 ' 

The main linear technique for dimensionality reduction, 
principal component analysis, performs a linear mapping 
of the data to a lower-dimensional space in such a way that 
the variance of the data in the low-dimensional represen¬ 
tation is maximized. In practice, the correlation matrix of 
the data is constructed and the eigenvectors on this matrix 


are computed. The eigenvectors that correspond to the 
largest eigenvalues (the principal components) can now 
be used to reconstruct a large fraction of the variance of 
the original data. Moreover, the first few eigenvectors can 
often be interpreted in terms of the large-scale physical 
behavior of the system. The original space (with dimen¬ 
sion of the number of points) has been reduced (with data 
loss, but hopefully retaining the most important variance) 
to the space spanned by a few eigenvectors. 

Principal component analysis can be employed in a non¬ 
linear way by means of the kernel trick. The resulting 
technique is capable of constructing nonlinear mappings 
that maximize the variance in the data. The resulting 
technique is entitled kernel PCA. Other prominent non¬ 
linear techniques include manifold learning techniques 
such as Isomap, locally linear embedding (LLE), Hes¬ 
sian LLE, Laplacian eigenmaps, and LTSA. These tech¬ 
niques construct a low-dimensional data representation 
using a cost function that retains local properties of the 
data, and can be viewed as defining a graph-based ker¬ 
nel for Kernel PCA. More recently, techniques have been 
proposed that, instead of defining a fixed kernel, try to 
learn the kernel using semidefinite programming. The 
most prominent example of such a technique is maximum 
variance unfolding (MVU). The central idea of MVU is 
to exactly preserve all pairwise distances between near¬ 
est neighbors (in the inner product space), while maxi¬ 
mizing the distances between points that are not nearest 
neighbors. A dimensionality reduction technique that is 
sometimes used in neuroscience is maximally informative 
dimensions, which finds a lower-dimensional representa¬ 
tion of a dataset such that as much information as possible 
about the original data is preserved. 

An alternative approach to neighborhood preservation is 
through the minimization of a cost function that mea¬ 
sures differences between distances in the input and out¬ 
put spaces. Important examples of such techniques in¬ 
clude classical multidimensional scaling (which is iden¬ 
tical to PCA), Isomap (which uses geodesic distances in 
the data space), diffusion maps (which uses diffusion dis¬ 
tances in the data space), t-SNE (which minimizes the di¬ 
vergence between distributions over pairs of points), and 
curvilinear component analysis. 

A different approach to nonlinear dimensionality reduc- 
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tion is through the use of autoencoders, a special kind of 
feed-forward neural networks with a bottle-neck hidden 
layer. 161 The training of deep encoders is typically per¬ 
formed using a greedy layer-wise pre-training (e.g., using 
a stack of restricted Boltzmann machines) that is followed 
by a finetuning stage based on backpropagation. 

48.3 Dimension reduction 

For high-dimensional datasets (i.e. with number of di¬ 
mensions more than 10), dimension reduction is usually 
performed prior to applying a K-nearest neighbors algo- 
rithm(k-NN) in order to avoid the effects of the curse of 
dimensionality. 171 

Feature extraction and dimension reduction can be com¬ 
bined in one step using principal component anal¬ 
ysis (PCA), linear discriminant analysis (LDA), or 
canonical correlation analysis (CCA) techniques as a 
pre-processing step followed by clustering by K-NN on 
feature vectors in reduced-dimension space. In machine 
learning this process is also called low-dimensional 
embedding. 181 

For very-high-dimensional datasets (e.g. when perform¬ 
ing similarity search on live video streams, DNA data 
or high-dimensional Time series) running a fast ap¬ 
proximate K-NN search using locality sensitive hash¬ 
ing, “random projections”,“sketches” 1101 or other high¬ 
dimensional similarity search techniques from the VLDB 
toolbox might be the only feasible option. 

48.4 See also 

• Nearest neighbor search 

• MinHash 

• Information gain in decision trees 

• Semidefinite embedding 

• Multifactor dimensionality reduction 

• Multilinear subspace learning 

• Multilinear PCA 

• Singular value decomposition 

• Latent semantic analysis 

• Semantic mapping 

• Topological data analysis 

• Locality sensitive hashing 

• Sufficient dimension reduction 

• Data transformation (statistics) 

• Weighted correlation network analysis 
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48.7 External links 
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Greedy algorithm 


36 - 20=16 


16 - 10=6 


6-5 = 1 

1 - 1=0 




Greedy algorithms determine minimum number of coins to give 
while making change. These are the steps a human would take to 
emulate a greedy algorithm to represent 36 cents using only coins 
with valuesfl, 5,10, 20/. The coin of the highest value, less than 
the remaining change owed, is the local optimum. (Note that in 
general the change-making problem requires dynamic program¬ 
ming or integer programming to find an optimal solution; how¬ 
ever, most currency systems, including the Euro and US Dollar, 
are special cases where the greedy strategy does find an optimal 
solution.) 


A greedy algorithm is an algorithm that follows the 
problem solving heuristic of making the locally optimal 
choice at each stage 111 with the hope of finding a global 
optimum. In many problems, a greedy strategy does not 
in general produce an optimal solution, but nonetheless a 
greedy heuristic may yield locally optimal solutions that 
approximate a global optimal solution in a reasonable 
time. 

For example, a greedy strategy for the traveling salesman 
problem (which is of a high computational complexity) is 
the following heuristic: “At each stage visit an unvisited 
city nearest to the current city”. This heuristic need not 
find a best solution, but terminates in a reasonable num¬ 
ber of steps; finding an optimal solution typically requires 
unreasonably many steps. In mathematical optimization, 
greedy algorithms solve combinatorial problems having 
the properties of matroids. 


49.1 Specifics 

In general, greedy algorithms have five components: 

1. A candidate set, from which a solution is created 

2. A selection function, which chooses the best candi¬ 
date to be added to the solution 

3. A feasibility function, that is used to determine if a 
candidate can be used to contribute to a solution 

4. An objective function, which assigns a value to a so¬ 
lution, or a partial solution, and 

5. A solution function, which will indicate when we 
have discovered a complete solution 


Greedy algorithms produce good solutions on some 
mathematical problems, but not on others. Most prob¬ 
lems for which they work will have two properties: 

Greedy choice property We can make whatever choice 
seems best at the moment and then solve the sub¬ 
problems that arise later. The choice made by a 
greedy algorithm may depend on choices made so 
far, but not on future choices or all the solutions 
to the subproblem. It iteratively makes one greedy 
choice after another, reducing each given problem 
into a smaller one. In other words, a greedy algo¬ 
rithm never reconsiders its choices. This is the main 
difference from dynamic programming, which is ex¬ 
haustive and is guaranteed to find the solution. 


After every stage, dynamic programming makes deci¬ 
sions based on all the decisions made in the previous 
stage, and may reconsider the previous stage’s algorith¬ 
mic path to solution. 


Optimal substructure “A problem exhibits optimal 
substructure if an optimal solution to the problem 
contains optimal solutions to the sub-problems.” [2] 
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49.1.1 Cases of failure 

Examples on how a greedy algorithm may fail to achieve 
the optimal solution. 


M 



Starting at A, a greedy algorithm will find the local max¬ 
imum at “m”, oblivious of the global maximum at “M”. 



With a goal of reaching the largest-sum, at each step, 
the greedy algorithm will choose what appears to be the 
optimal immediate choice, so it will choose 12 instead of 
3 at the second step, and will not reach the best solution, 
which contains 99. 

For many other problems, greedy algorithms fail to pro¬ 
duce the optimal solution, and may even produce the 
unique worst possible solution. One example is the 
traveling salesman problem mentioned above: for each 
number of cities, there is an assignment of distances be¬ 
tween the cities for which the nearest neighbor heuristic 
produces the unique worst possible tour. 131 

49.2 Types 

Greedy algorithms can be characterized as being 'short 
sighted', and also as 'non-recoverable'. They are ideal only 
for problems which have 'optimal substructure'. Despite 
this, for many simple problems (e.g. giving change), the 


best suited algorithms are greedy algorithms. It is impor¬ 
tant, however, to note that the greedy algorithm can be 
used as a selection algorithm to prioritize options within 
a search, or branch and bound algorithm. There are a few 
variations to the greedy algorithm: 

• Pure greedy algorithms 

• Orthogonal greedy algorithms 

• Relaxed greedy algorithms 

49.3 Applications 

m 

Greedy algorithms mostly (but not always) fail to find the 
globally opmnSi] solution, because they usually do not op¬ 
erate exnaustiveljNpn all the data. They can make com¬ 
mitments to certain clmices too early which prevent them 
from finding the best overall solution later. For example, 
all known greedy coloring algorithms for the graph color¬ 
ing problem and all other NPAxmiplete problems do not 
consistently find optimum solutions. Nevertheless, they 
are useful because they are quick tb think up and often 
give good approximations to the optimum. 

If a greedy algorithm can be proven to yield the global 
optimum for a given problem class, it typically becomes 
the method of choice because it is faster than other opti¬ 
mization methods like dynamic programming. Examples 
of such greedy algorithms are Kruskal’s algorithm and 
Prim’s algorithm for finding minimum spanning trees, 
and the algorithm for finding optimum Huffman trees. 

The theory of matroids, and the more general theory of 
greedoids, provide whole classes of such algorithms. 

Greedy algorithms appear in network routing as well. Us¬ 
ing greedy routing, a message is forwarded to the neigh¬ 
boring node which is “closest” to the destination. The 
notion of a node’s location (and hence “closeness”) may 
be determined by its physical location, as in geographic 
routing used by ad hoc networks. Focation may also be 
an entirely artificial construct as in small world routing 
and distributed hash table. 

49.4 Examples 

• The activity selection problem is characteristic to 
this class of problems, where the goal is to pick the 
maximum number of activities that do not clash with 
each other. 

• In the Macintosh computer game Crystal Quest the 
objective is to collect crystals, in a fashion similar 
to the travelling salesman problem. The game has 
a demo mode, where the game uses a greedy algo¬ 
rithm to go to every crystal. The artificial intelli¬ 
gence does not account for obstacles, so the demo 
mode often ends quickly. 
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• The matching pursuit is an example of greedy algo¬ 
rithm applied on signal approximation. 

• A greedy algorithm finds the optimal solution to 
Malfatti’s problem of finding three disjoint circles 
within a given triangle that maximize the total area 
of the circles; it is conjectured that the same greedy 
algorithm is optimal for any number of circles. 

• A greedy algorithm is used to construct a Huffman 
tree during Huffman coding where it finds an opti¬ 
mal solution. 

• In decision tree learning, greedy algorithms are 
commonly used, however they are not guaranteed 
to find the optimal solution. 


49.8 External links 

• Hazewinkel, Michiel, ed. (2001), “Greedy al¬ 
gorithm”, Encyclopedia of Mathematics , Springer, 
ISBN 978-1-55608-010-4 

• Greedy algorithm visualization A visualization of a 
greedy solution to the N-Queens puzzle by Yuval 
Baror. 

• Python greedy coin example by Noah Gift. 

• Java implementation used in a Checkers Game 


49.5 See also 

• Epsilon-greedy strategy 

• Greedy algorithm for Egyptian fractions 

• Greedy source 

• Matroid 


49.6 Notes 

[1J Black, Paul E. (2 February 2005). “greedy algorithm”. 
Dictionary of Algorithms and Data Structures. U.S. Na¬ 
tional Institute of Standards and Technology (NIST). Re¬ 
trieved 17 August 2012. 

[2] Introduction to Algorithms (Cormen, Leiserson, Rivest, 
and Stein) 2001, Chapter 16 “Greedy Algorithms”. 

[3] (G. Gutin, A. Yeo and A. Zverovich, 2002) 
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Chapter 50 

Reinforcement learning 


For reinforcement learning in psychology, see 
Reinforcement. 

Reinforcement learning is an area of machine learn¬ 
ing inspired by behaviorist psychology, concerned with 
how software agents ought to take actions in an environ¬ 
ment so as to maximize some notion of cumulative re¬ 
ward. The problem, due to its generality, is studied in 
many other disciplines, such as game theory, control the¬ 
ory, operations research, information theory, simulation- 
based optimization, multi-agent systems, swarm intelli¬ 
gence, statistics, and genetic algorithms. In the opera¬ 
tions research and control literature, the field where rein¬ 
forcement learning methods are studied is called approxi¬ 
mate dynamic programming. The problem has been stud¬ 
ied in the theory of optimal control, though most stud¬ 
ies are concerned with the existence of optimal solutions 
and their characterization, and not with the learning or 
approximation aspects. In economics and game theory, 
reinforcement learning may be used to explain how equi¬ 
librium may arise under bounded rationality. 

In machine learning, the environment is typically formu¬ 
lated as a Markov decision process (MDP) as many re¬ 
inforcement learning algorithms for this context utilize 
dynamic programming techniques. The main difference 
between the classical techniques and reinforcement learn¬ 
ing algorithms is that the latter do not need knowledge 
about the MDP and they target large MDPs where exact 
methods become infeasible. 

Reinforcement learning differs from standard supervised 
learning in that correct input/output pairs are never pre¬ 
sented, nor sub-optimal actions explicitly corrected. Fur¬ 
ther, there is a focus on on-line performance, which 
involves finding a balance between exploration (of un¬ 
charted territory) and exploitation (of current knowl¬ 
edge). The exploration vs. exploitation trade-off in re¬ 
inforcement learning has been most thoroughly studied 
through the multi-armed bandit problem and in finite 
MDPs. 


50.1 Introduction 

The basic reinforcement learning model consists of: 

1. a set of environment states S ; 

2. a set of actions A ; 

3. rules of transitioning between states; 

4. rules that determine the scalar immediate reward of 
a transition; and 

5. rules that describe what the agent observes. 

The rules are often stochastic. The observation typically 
involves the scalar immediate reward associated with the 
last transition. In many works, the agent is also assumed 
to observe the current environmental state, in which case 
we talk about full observability, whereas in the opposing 
case we talk about partial observability. Sometimes the 
set of actions available to the agent is restricted (e.g., you 
cannot spend more money than what you possess). 

A reinforcement learning agent interacts with its environ¬ 
ment in discrete time steps. At each time t , the agent 
receives an observation o t , which typically includes the 
reward r*. It then chooses an action at from the set of ac¬ 
tions available, which is subsequently sent to the environ¬ 
ment. The environment moves to a new state St+i and the 
reward ry+i associated with the transition (st,at,St+ 1 ) 
is determined. The goal of a reinforcement learning agent 
is to collect as much reward as possible. The agent can 
choose any action as a function of the history and it can 
even randomize its action selection. 

When the agent’s performance is compared to that of an 
agent which acts optimally from the beginning, the dif¬ 
ference in performance gives rise to the notion of regret. 
Note that in order to act near optimally, the agent must 
reason about the long term consequences of its actions: 
In order to maximize my future income I had better go 
to school now, although the immediate monetary reward 
associated with this might be negative. 

Thus, reinforcement learning is particularly well suited to 
problems which include a long-term versus short-term re¬ 
ward trade-off. It has been applied successfully to various 
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problems, including robot control, elevator scheduling, 
telecommunications, backgammon and checkers (Sutton 
and Barto 1998, Chapter 11). 

Two components make reinforcement learning power¬ 
ful: The use of samples to optimize performance and the 
use of function approximation to deal with large environ¬ 
ments. Thanks to these two key components, reinforce¬ 
ment learning can be used in large environments in any 
of the following situations: 

• A model of the environment is known, but an ana¬ 
lytic solution is not available; 

• Only a simulation model of the environment is given 
(the subject of simulation-based optimization ); 1 ' 1 

• The only way to collect information about the envi¬ 
ronment is by interacting with it. 

The first two of these problems could be considered plan¬ 
ning problems (since some form of the model is avail¬ 
able), while the last one could be considered as a gen¬ 
uine learning problem. However, under a reinforcement 
learning methodology both planning problems would be 
converted to machine learning problems. 


50.2 Exploration 

The reinforcement learning problem as described re¬ 
quires clever exploration mechanisms. Randomly select¬ 
ing actions, without reference to an estimated probability 
distribution, is known to give rise to very poor perfor¬ 
mance. The case of (small) finite MDPs is relatively well 
understood by now. However, due to the lack of algo¬ 
rithms that would provably scale well with the number 
of states (or scale to problems with infinite state spaces), 
in practice people resort to simple exploration methods. 
One such method is e -greedy, when the agent chooses the 
action that it believes has the best long-term effect with 
probability 1 — e , and it chooses an action uniformly at 
random, otherwise. Here, 0 < e < 1 is a tuning pa¬ 
rameter, which is sometimes changed, either according 
to a fixed schedule (making the agent explore less as time 
goes by), or adaptively based on some heuristics (Tokic 
& Palm, 2011). 

50.3 Algorithms for control learn¬ 
ing 

Even if the issue of exploration is disregarded and even 
if the state was observable (which we assume from now 
on), the problem remains to find out which actions are 
good based on past experience. 


50.3.1 Criterion of optimality 

For simplicity, assume for a moment that the problem 
studied is episodic, an episode ending when some ter¬ 
minal state is reached. Assume further that no matter 
what course of actions the agent takes, termination is 
inevitable. Under some additional mild regularity con¬ 
ditions the expectation of the total reward is then well- 
defined, for any policy and any initial distribution over 
the states. Here, a policy refers to a mapping that assigns 
some probability distribution over the actions to all pos¬ 
sible histories. 

Given a fixed initial distribution p , we can thus assign the 
expected return p' K to policy tt : 


p* = E[R\tt], 

where the random variable R denotes the return and is 
defined by 

N -1 

R = 5Z r ‘+i> 

t=0 

where r t +i is the reward received after the t -th transition, 
the initial state is sampled at random from p. and actions 
are selected by policy n . Here, N denotes the (random) 
time when a terminal state is reached, i.e., the time when 
the episode terminates. 

In the case of non-episodic problems the return is often 
discounted, 

OO 

R = X^ r *+ 1 ’ 

t =0 

giving rise to the total expected discounted reward crite¬ 
rion. Here 0 < 7 < 1 is the so-called discount-factor. 
Since the undiscounted return is a special case of the dis¬ 
counted return, from now on we will assume discounting. 
Although this looks innocent enough, discounting is in 
fact problematic if one cares about online performance. 
This is because discounting makes the initial time steps 
more important. Since a learning agent is likely to make 
mistakes during the first few steps after its “life” starts, no 
uninformed learning algorithm can achieve near-optimal 
performance under discounting even if the class of en¬ 
vironments is restricted to that of finite MDPs. (This 
does not mean though that, given enough time, a learn¬ 
ing agent cannot figure how to act near-optimally, if time 
was restarted.) 

The problem then is to specify an algorithm that can be 
used to find a policy with maximum expected return. 
From the theory of MDPs it is known that, without loss 
of generality, the search can be restricted to the set of the 
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so-called stationary policies. A policy is called station¬ 
ary if the action-distribution returned by it depends only 
on the last state visited (which is part of the observation 
history of the agent, by our simplifying assumption). In 
fact, the search can be further restricted to deterministic 
stationary policies. A deterministic stationary policy is 
one which deterministically selects actions based on the 
current state. Since any such policy can be identified with 
a mapping from the set of states to the set of actions, these 
policies can be identified with such mappings with no loss 
of generality. 

50.3.2 Brute force 

The brute force approach entails the following two steps: 

1. For each possible policy, sample returns while fol¬ 
lowing it 

2. Choose the policy with the largest expected return 

One problem with this is that the number of policies can 
be extremely large, or even infinite. Another is that vari¬ 
ance of the returns might be large, in which case a large 
number of samples will be required to accurately estimate 
the return of each policy. 

These problems can be ameliorated if we assume some 
structure and perhaps allow samples generated from one 
policy to influence the estimates made for another. The 
two main approaches for achieving this are value function 
estimation and direct policy search. 

50.3.3 Value function approaches 

Value function approaches attempt to find a policy that 
maximizes the return by maintaining a set of estimates 
of expected returns for some policy (usually either the 
“current” or the optimal one). 

These methods rely on the theory of MDPs, where op¬ 
timality is defined in a sense which is stronger than the 
above one: A policy is called optimal if it achieves the 
best expected return from any initial state (i.e., initial dis¬ 
tributions play no role in this definition). Again, one can 
always find an optimal policy amongst stationary policies. 

To define optimality in a formal manner, define the value 
of a policy tt by 


V”(s) = E[R\s,n} : 

where R stands for the random return associated with fol¬ 
lowing 7 r from the initial state s . Define V*(s) as the 
maximum possible value of V n (s) , where 7 r is allowed 
to change: 


V*(s) =supV 7r (s). 

7r 

A policy which achieves these optimal values in each state 
is called optimal. Clearly, a policy optimal in this strong 
sense is also optimal in the sense that it maximizes the 
expected return p 7r , since p 77 = i?[V r 7 r (iS')] , where S is 
a state randomly sampled from the distribution p . 

Although state-values suffice to define optimality, it will 
prove to be useful to define action-values. Given a state s 
, an action a and a policy tt , the action-value of the pair 
(s, a) under tt is defined by 

Q*(s,a) = E[R\s,a,n], 

where, now, R stands for the random return associated 
with first taking action a in state s and following 7 r, there¬ 
after. 

It is well-known from the theory of MDPs that if someone 
gives us Q for an optimal policy, we can always choose 
optimal actions (and thus act optimally) by simply choos¬ 
ing the action with the highest value at each state. The 
action-value function of such an optimal policy is called 
the optimal action-value function and is denoted by Q* . 
In summary, the knowledge of the optimal action-value 
function alone suffices to know how to act optimally. 

Assuming full knowledge of the MDP, there are two ba¬ 
sic approaches to compute the optimal action-value func¬ 
tion, value iteration and policy iteration. Both algorithms 
compute a sequence of functions Q( k = 0 , 1 , 2 ,..., 
) which converge to Q* . Computing these functions 
involves computing expectations over the whole state- 
space, which is impractical for all, but the smallest (finite) 
MDPs, never mind the case when the MDP is unknown. 
In reinforcement learning methods the expectations are 
approximated by averaging over samples and one uses 
function approximation techniques to cope with the need 
to represent value functions over large state-action spaces. 

Monte Carlo methods 

The simplest Monte Carlo methods can be used in an 
algorithm that mimics policy iteration. Policy iteration 
consists of two steps: policy evaluation and policy im¬ 
provement. The Monte Carlo methods are used in the 
policy evaluation step. In this step, given a stationary, 
deterministic policy 7 r , the goal is to compute the func¬ 
tion values Q n (s, a) (or a good approximation to them) 
for all state-action pairs (s, a) . Assume (for simplic¬ 
ity) that the MDP is finite and in fact a table representing 
the action-values fits into the memory. Further, assume 
that the problem is episodic and after each episode a new 
one starts from some random initial state. Then, the esti¬ 
mate of the value of a given state-action pair (s, a) can be 
computed by simply averaging the sampled returns which 
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originated from (s, a) over time. Given enough time, this 
procedure can thus construct a precise estimate Q of the 
action-value function Q 77 . This finishes the description 
of the policy evaluation step. In the policy improvement 
step, as it is done in the standard policy iteration algo¬ 
rithm, the next policy is obtained by computing a greedy 
policy with respect to Q : Given a state s , this new pol¬ 
icy returns an action that maximizes Q(s, •) . In practice 
one often avoids computing and storing the new policy, 
but uses lazy evaluation to defer the computation of the 
maximizing actions to when they are actually needed. 

A few problems with this procedure are as follows: 

• The procedure may waste too much time on evalu¬ 
ating a suboptimal policy; 

• It uses samples inefficiently in that a long trajectory 
is used to improve the estimate only of the single 
state-action pair that started the trajectory; 

• When the returns along the trajectories have high 
variance, convergence will be slow; 

• It works in episodic problems only, 

• It works in small, finite MDPs only. 

Temporal difference methods 

The first issue is easily corrected by allowing the proce¬ 
dure to change the policy (at all, or at some states) before 
the values settle. However good this sounds, this may be 
dangerous as this might prevent convergence. Still, most 
current algorithms implement this idea, giving rise to the 
class of generalized policy iteration algorithm. We note in 
passing that actor critic methods belong to this category. 

The second issue can be corrected within the algorithm 
by allowing trajectories to contribute to any state-action 
pair in them. This may also help to some extent with 
the third problem, although a better solution when returns 
have high variance is to use Sutton's temporal difference 
(TD) methods which are based on the recursive Bellman 
equation. Note that the computation in TD methods can 
be incremental (when after each transition the memory 
is changed and the transition is thrown away), or batch 
(when the transitions are collected and then the estimates 
are computed once based on a large number of transi¬ 
tions). Batch methods, a prime example of which is the 
least-squares temporal difference method due to Bradtke 
and Barto (1996), may use the information in the samples 
better, whereas incremental methods are the only choice 
when batch methods become infeasible due to their high 
computational or memory complexity. In addition, there 
exist methods that try to unify the advantages of the two 
approaches. Methods based on temporal differences also 
overcome the second but last issue. 

In order to address the last issue mentioned in the previ¬ 
ous section, function approximation methods are used. In 


linear function approximation one starts with a mapping 
tf> that assigns a finite-dimensional vector to each state- 
action pair. Then, the action values of a state-action pair 
(s, a) are obtained by linearly combining the components 
of (f>(s , a ) with some weights 6 : 

d 

Q(s,a) = y^Oj(j)i(s,a) 

2 — 1 

The algorithms then adjust the weights, instead of adjust¬ 
ing the values associated with the individual state-action 
pairs. However, linear function approximation is not the 
only choice. More recently, methods based on ideas from 
nonparametric statistics (which can be seen to construct 
their own features) have been explored. 

So far, the discussion was restricted to how policy iter¬ 
ation can be used as a basis of the designing reinforce¬ 
ment learning algorithms. Equally importantly, value it¬ 
eration can also be used as a starting point, giving rise to 
the Q-Learning algorithm (Watkins 1989) and its many 
variants. 

The problem with methods that use action-values is that 
they may need highly precise estimates of the competing 
action values, which can be hard to obtain when the re¬ 
turns are noisy. Though this problem is mitigated to some 
extent by temporal difference methods and if one uses 
the so-called compatible function approximation method, 
more work remains to be done to increase generality and 
efficiency. Another problem specific to temporal differ¬ 
ence methods comes from their reliance on the recursive 
Bellman equation. Most temporal difference methods 
have a so-called A parameter (0 < A < 1) that allows one 
to continuously interpolate between Monte-Carlo meth¬ 
ods (which do not rely on the Bellman equations) and the 
basic temporal difference methods (which rely entirely 
on the Bellman equations), which can thus be effective in 
palliating this issue. 

50.3.4 Direct policy search 

An alternative method to find a good policy is to search 
directly in (some subset of) the policy space, in which 
case the problem becomes an instance of stochastic op¬ 
timization. The two approaches available are gradient- 
based and gradient-free methods. 

Gradient-based methods (giving rise to the so-called pol¬ 
icy gradient methods) start with a mapping from a finite¬ 
dimensional (parameter) space to the space of policies: 
given the parameter vector 0 , let ng denote the policy 
associated to 6 . Define the performance function by 

p{6)= Pl¬ 
under mild conditions this function will be differentiable 
as a function of the parameter vector 9 . If the gradient 


50.5. CURRENT RESEARCH 


331 


of p was known, one could use gradient ascent. Since an 
analytic expression for the gradient is not available, one 
must rely on a noisy estimate. Such an estimate can be 
constructed in many ways, giving rise to algorithms like 
Williams’ REINFORCE method (which is also known as 
the likelihood ratio method in the simulation-based op¬ 
timization literature). Policy gradient methods have re¬ 
ceived a lot of attention in the last couple of years (e.g., 
Peters et al. (2003)), but they remain an active field. 
An overview of policy search methods in the context of 
robotics has been given by Deisenroth, Neumann and 
Peters. 12 ' The issue with many of these methods is that 
they may get stuck in local optima (as they are based on 
local search). 

A large class of methods avoids relying on gradient in¬ 
formation. These include simulated annealing, cross¬ 
entropy search or methods of evolutionary computation. 
Many gradient-free methods can achieve (in theory and 
in the limit) a global optimum. In a number of cases they 
have indeed demonstrated remarkable performance. 

The issue with policy search methods is that they may 
converge slowly if the information based on which they 
act is noisy. For example, this happens when in episodic 
problems the trajectories are long and the variance of the 
returns is large. As argued beforehand, value-function 
based methods that rely on temporal differences might 
help in this case. In recent years, several actor-critic al¬ 
gorithms have been proposed following this idea and were 
demonstrated to perform well in various problems. 


50.4 Theory 


The theory for small, finite MDPs is quite mature. Both 
the asymptotic and finite-sample behavior of most algo¬ 
rithms is well-understood. As mentioned beforehand, 
algorithms with provably good online performance (ad¬ 
dressing the exploration issue) are known. The the¬ 
ory of large MDPs needs more work. Efficient explo¬ 
ration is largely untouched (except for the case of ban¬ 
dit problems). Although finite-time performance bounds 
appeared for many algorithms in the recent years, these 
bounds are expected to be rather loose and thus more 
work is needed to better understand the relative advan¬ 
tages, as well as the limitations of these algorithms. 
For incremental algorithm asymptotic convergence issues 
have been settled. Recently, new incremental, temporal- 
difference-based algorithms have appeared which con¬ 
verge under a much wider set of conditions than was pre¬ 
viously possible (for example, when used with arbitrary, 
smooth function approximation). 


50.5 Current research 

Current research topics include: adaptive methods which 
work with fewer (or no) parameters under a large number 
of conditions, addressing the exploration problem in large 
MDPs, large-scale empirical evaluations, learning and 
acting under partial information (e.g., using Predictive 
State Representation), modular and hierarchical rein¬ 
forcement learning, improving existing value-function 
and policy search methods, algorithms that work well 
with large (or continuous) action spaces, transfer learn¬ 
ing, lifelong learning, efficient sample-based planning 
(e.g., based on Monte-Carlo tree search). Multiagent or 
Distributed Reinforcement Learning is also a topic of in¬ 
terest in current research. There is also a growing interest 
in real life applications of reinforcement learning. Suc¬ 
cesses of reinforcement learning are collected on here and 
here. 

Reinforcement learning algorithms such as TD learning 
are also being investigated as a model for Dopamine- 
based learning in the brain. In this model, the dopamin¬ 
ergic projections from the substantia nigra to the basal 
ganglia function as the prediction error. Reinforcement 
learning has also been used as a part of the model for 
human skill learning, especially in relation to the inter¬ 
action between implicit and explicit learning in skill ac¬ 
quisition (the first publication on this application was 
in 1995-1996, and there have been many follow-up 
studies). See http://webdocs.cs.ualberta.ca/~{}sutton/ 
RL-FAQ.html#behaviorism for further details of these 
research areas above. 


50.6 Literature 

50.6.1 Conferences, journals 

Most reinforcement learning papers are published at the 
major machine learning and AI conferences (ICML, 
NIPS, AAAI, IJCAI, UAI, AI and Statistics) and jour¬ 
nals (JAIR, JMLR, Machine learning journal, IEEE T- 
CIAIG). Some theory papers are published at COLT 
and ALT. However, many papers appear in robotics 
conferences (IROS, ICRA) and the “agent” conference 
AAMAS. Operations researchers publish their papers 
at the INFORMS conference and, for example, in the 
Operation Research, and the Mathematics of Operations 
Research journals. Control researchers publish their pa¬ 
pers at the CDC and ACC conferences, or, e.g., in the 
journals IEEE Transactions on Automatic Control, or 
Automatica, although applied works tend to be published 
in more specialized journals. The Winter Simulation 
Conference also publishes many relevant papers. Other 
than this, papers also published in the major confer¬ 
ences of the neural networks, fuzzy, and evolutionary 
computation communities. The annual IEEE sympo¬ 
sium titled Approximate Dynamic Programming and Re- 


332 


CHAPTER 50. REINFORCEMENT LEARNING 


inforcement Learning (ADPRL) and the biannual Euro¬ 
pean Workshop on Reinforcement Learning (EWRL) are 
two regularly held meetings where RL researchers meet. 

50.7 See also 

• Temporal difference learning 

• Q-learning 

• SARSA 

• Fictitious play 

• Learning classifier system 

• Optimal control 

• Dynamic treatment regimes 

• Error-driven learning 

• Multi-agent system 

• Distributed artificial intelligence 

50.8 Implementations 

• RL-Glue provides a standard interface that allows 
you to connect agents, environments, and experi¬ 
ment programs together, even if they are written in 
different languages. 

• Maja Machine Learning Framework The Maja Ma¬ 
chine Learning Framework (MMLF) is a general 
framework for problems in the domain of Rein¬ 
forcement Learning (RL) written in python. 

• Software Tools for Reinforcement Learning (Matlab 
and Python) 

• PyBrain(Python) 

• TeachingBox is a Java reinforcement learning 
framework supporting many features like RBF net¬ 
works, gradient descent learning methods, ... 

• C++ and Python implementations for some well 
known reinforcement learning algorithms with 
source. 

• Orange, a free data mining software suite, module 
orngReinforcement 

• Policy Gradient Toolbox provides a package for 
learning about policy gradient approaches. 

• BURLAP is an open source Java library that pro¬ 
vides a wide range of single and multi-agent learning 
and planning methods. 
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(Graz University of Technology) 

• Hybrid reinforcement learning 

• Piqle: a Generic Java Platform for Reinforcement 
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Learning Algorithms 

• Reinforcement Learning applied to Tic-Tac-Toe 
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• Scholarpedia Reinforcement Learning 

• Scholarpedia Temporal Difference Learning 

• Stanford Reinforcement Learning Course 

• Real-world reinforcement learning experiments at 
Delft University of Technology 

• Reinforcement Learning Tools for Matlab 

• Stanford University Andrew Ng Lecture on Rein¬ 
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• Website for Reinforcement Learning: An Introduc¬ 
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Press, including a link to an html version of the 
book. 

• Reinforcement Learning Repository 

• Reinforcement Learning and Artificial Intelligence 
(RLAI, Rich Sutton’s lab at the University of Al¬ 
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• Autonomous Learning Laboratory (ALL, Andrew 
Barto’s lab at the University of Massachusetts 
Amherst) 

• RL-Glue 

• Software Tools for Reinforcement Learning (Matlab 
and Python) 


Chapter 51 

Decision tree learning 


This article is about decision trees in machine learning. 
For the use of the term in decision analysis, see Decision 
tree. 


[7es| is sex male? [~^~| 


Decision tree learning uses a decision tree as a 
predictive model which maps observations about an item 
to conclusions about the item’s target value. It is one 
of the predictive modelling approaches used in statistics, 
data mining and machine learning. Tree models where 
the target variable can take a finite set of values are called 
classification trees. In these tree structures, leaves rep¬ 
resent class labels and branches represent conjunctions 
of features that lead to those class labels. Decision trees 
where the target variable can take continuous values (typ¬ 
ically real numbers) are called regression trees. 


is age > 9.5? 



(survived ) 

0.73 36% 


is sibsp > 2.5? 


In decision analysis, a decision tree can be used to visu¬ 
ally and explicitly represent decisions and decision mak¬ 
ing. In data mining, a decision tree describes data but not 
decisions; rather the resulting classification tree can be an 
input for decision making. This page deals with decision 
trees in data mining. 


A tree showing survival of passengers on the Titanic (“sibsp” is 
the number of spouses or siblings aboard). The figures under 
the leaves show the probability of survival and the percentage of 
observations in the leaf. 


51.1 General 

Decision tree learning is a method commonly used in data 
mining. 11 1 The goal is to create a model that predicts the 
value of a target variable based on several input variables. 
An example is shown on the right. Each interior node cor¬ 
responds to one of the input variables; there are edges to 
children for each of the possible values of that input vari¬ 
able. Each leaf represents a value of the target variable 
given the values of the input variables represented by the 
path from the root to the leaf. 

A decision tree is a simple representation for classifying 
examples. Decision tree learning is one of the most suc¬ 
cessful techniques for supervised classification learning. 
For this section, assume that all of the features have fi¬ 
nite discrete domains, and there is a single target feature 
called the classification. Each element of the domain of 
the classification is called a class. A decision tree or a 
classification tree is a tree in which each internal (non¬ 
leaf) node is labeled with an input feature. The arcs com¬ 


ing from a node labeled with a feature are labeled with 
each of the possible values of the feature. Each leaf of 
the tree is labeled with a class or a probability distribu¬ 
tion over the classes. 

A tree can be “learned” by splitting the source set into 
subsets based on an attribute value test. This process is 
repeated on each derived subset in a recursive manner 
called recursive partitioning. The recursion is completed 
when the subset at a node has all the same value of the 
target variable, or when splitting no longer adds value to 
the predictions. This process of top-down induction of 
decision trees (TDIDT) 121 is an example of a greedy al¬ 
gorithm, and it is by far the most common strategy for 
learning decision trees from data. 

In data mining, decision trees can be described also as the 
combination of mathematical and computational tech¬ 
niques to aid the description, categorisation and gener¬ 
alisation of a given set of data. 

Data comes in records of the form: 
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(X, Y) = (Xi,X-2,X3, ...,Xk,Y) 

The dependent variable, Y, is the target variable that we 
are trying to understand, classify or generalize. The vec¬ 
tor x is composed of the input variables, x 1; x 2 , x 3 etc., 
that are used for that task. 


51.2 Types 

Decision trees used in data mining are of two main types: 

• Classification tree analysis is when the predicted 
outcome is the class to which the data belongs. 

• Regression tree analysis is when the predicted out¬ 
come can be considered a real number (e.g. the price 
of a house, or a patient’s length of stay in a hospital). 

The term Classification And Regression Tree (CART) 

analysis is an umbrella term used to refer to both of the 
above procedures, first introduced by Breiman et al. [31 
Trees used for regression and trees used for classification 
have some similarities - but also some differences, such 
as the procedure used to determine where to split. 131 

Some techniques, often called ensemble methods, con¬ 
struct more than one decision tree: 

• Bagging decision trees, an early ensemble method, 
builds multiple decision trees by repeatedly resam¬ 
pling training data with replacement, and voting the 
trees for a consensus prediction. 141 

• A Random Forest classifier uses a number of deci¬ 
sion trees, in order to improve the classification rate. 

• Boosted Trees can be used for regression-type and 
classification-type problems. 1 51161 

• Rotation forest - in which every decision tree 
is trained by first applying principal component 
analysis (PCA) on a random subset of the input 
features. 171 

Decision tree learning is the construction of a decision 
tree from class-labeled training tuples. A decision tree is 
a flow-chart-like structure, where each internal (non-leaf) 
node denotes a test on an attribute, each branch represents 
the outcome of a test, and each leaf (or terminal) node 
holds a class label. The topmost node in a tree is the root 
node. 

There are many specific decision-tree algorithms. No¬ 
table ones include: 

• ID3 (Iterative Dichotomiser 3) 


• C4.5 (successor of ID3) 

• CART (Classification And Regression Tree) 

• CHAID (CHi-squared Automatic Interaction De¬ 
tector). Performs multi-level splits when computing 
classification trees. 181 

• MARS: extends decision trees to handle numerical 
data better. 

• Conditional Inference Trees. Statistics-based ap¬ 
proach that uses non-parametric tests as splitting cri¬ 
teria, corrected for multiple testing to avoid over¬ 
fitting. This approach results in unbiased predictor 
selection and does not require pruning. 1911101 

ID3 and CART were invented independently at around 
the same time (between 1970 and 1980), yet follow a 
similar approach for learning decision tree from training 
tuples. 


51.3 Metrics 

Algorithms for constructing decision trees usually work 
top-down, by choosing a variable at each step that best 
splits the set of items. 1111 Different algorithms use differ¬ 
ent metrics for measuring “best”. These generally mea¬ 
sure the homogeneity of the target variable within the 
subsets. Some examples are given below. These metrics 
are applied to each candidate subset, and the resulting val¬ 
ues are combined (e.g., averaged) to provide a measure of 
the quality of the split. 


51.3.1 Gini impurity 

Not to be confused with Gini coefficient. 

Used by the CART (classification and regression tree) al¬ 
gorithm, Gini impurity is a measure of how often a ran¬ 
domly chosen element from the set would be incorrectly 
labeled if it were randomly labeled according to the dis¬ 
tribution of labels in the subset. Gini impurity can be 
computed by summing the probability of each item being 
chosen times the probability of a mistake in categorizing 
that item. It reaches its minimum (zero) when all cases 
in the node fall into a single target category. 

To compute Gini impurity for a set of items, suppose i £ 
{1, 2,..., m} , and let /, be the fraction of items labeled 
with value i in the set. 

W) = E™i/Ki - fi) = TZ i(/i - f 2 ) = 

E m n sr^m n 2 -i v^ra r 2 

2=1 Ji ~ Ji — 1 — 2^i=l Ji 
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51.3.2 Information gain 

Main article: Information gain in decision trees 


• Possible to validate a model using statistical 
tests. That makes it possible to account for the reli¬ 
ability of the model. 


Used by the ID3, C4.5 and C5.0 tree-generation algo¬ 
rithms. Information gain is based on the concept of 
entropy from information theory. 

IeU) = - YZ i h lo §2 fi 

51.3.3 Variance reduction 

Introduced in CART, 131 variance reduction is often em¬ 
ployed in cases where the target variable is continuous 
(regression tree), meaning that use of many other metrics 
would first require discretization before being applied. 
The variance reduction of a node N is defined as the total 
reduction of the variance of the target variable x due to 
the split at this node: 

Iv(N) = \i\zYiesYjes h( x i — x i) 1 

(js^p YieSt YjeSt U x i ~~ x 3 ) 2 + ]S7P SieSj. YjeS f 

where S , St , and S / are the set of presplit sample in¬ 
dices, set of sample indices for which the split test is true, 
and set of sample indices for which the split test is false, 
respectively. Each of the above summands are indeed 
variance estimates, though, written in a form without di¬ 
rectly referring to the mean. 


51.4 Decision tree advantages 

Amongst other data mining methods, decision trees have 
various advantages: 

• Simple to understand and interpret. People are 
able to understand decision tree models after a brief 
explanation. 

• Requires little data preparation. Other tech¬ 
niques often require data normalisation, dummy 
variables need to be created and blank values to be 
removed. 

• Able to handle both numerical and categorical 
data. Other techniques are usually specialised in 
analysing datasets that have only one type of vari¬ 
able. (For example, relation rules can be used only 
with nominal variables while neural networks can be 
used only with numerical variables.) 

• Uses a white box model. If a given situation is ob¬ 
servable in a model the explanation for the condition 
is easily explained by boolean logic. (An example 
of a black box model is an artificial neural network 
since the explanation for the results is difficult to un¬ 
derstand.) 


• Robust. Performs well even if its assumptions are 
somewhat violated by the true model from which the 
data were generated. 

• Performs well with large datasets. Large amounts 
of data can be analysed using standard computing 
resources in reasonable time. 


51.5 Limitations 


1 

2 


• The problem of learning an optimal decision tree 
is known to be NP-complete under several aspects 
of optimality and even for simple concepts. 11211131 
Consequently, practical decision-tree learning algo¬ 
rithms are based on heuristics such as the greedy al- 
/ gorithra\vhere locally-optimal decisions are made at 
each node. Such algorithms cannot guarantee to re¬ 
turn the globally-optimal decision tree. To reduce 
the greedy effect of local-optimality some methods 
such as the dual information distance (DID) tree 
were proposed. 1141 


• Decision-tree learners can create over-complex 
trees that do not generalise well from the training 
data. (This is known as overfitting. 1151 ) Mechanisms 
such as pruning are necessary to avoid this problem 
(with the exception of some algorithms such as the 
Conditional Inference approach, that does not re¬ 
quire pruning 1911 101 ). 

• There are concepts that are hard to learn because 
decision trees do not express them easily, such 
as XOR, parity or multiplexer problems. In such 
cases, the decision tree becomes prohibitively large. 
Approaches to solve the problem involve either 
changing the representation of the problem domain 
(known as propositionalisation) 1 19 or using learn¬ 
ing algorithms based on more expressive repre¬ 
sentations (such as statistical relational learning or 
inductive logic programming). 

• For data including categorical variables with dif¬ 
ferent numbers of levels, information gain in deci¬ 
sion trees is biased in favor of those attributes with 
more levels. 1171 However, the issue of biased predic¬ 
tor selection is avoided by the Conditional Inference 
approach. 191 


51.6 Extensions 
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51.6.1 Decision graphs 

In a decision tree, all paths from the root node to the leaf 
node proceed by way of conjunction, or AND. In a de¬ 
cision graph, it is possible to use disjunctions (ORs) to 
join two more paths together using Minimum message 
length (MML)J 181 Decision graphs have been further ex¬ 
tended to allow for previously unstated new attributes to 
be learnt dynamically and used at different places within 
the graph. 1 191 The more general coding scheme results in 
better predictive accuracy and log-loss probabilistic scor¬ 
ing. In general, decision graphs infer models with fewer 
leaves than decision trees. 

51.6.2 Alternative search methods 

Evolutionary algorithms have been used to avoid local op¬ 
timal decisions and search the decision tree space with 
little a priori bias. [20] 1211 

It is also possible for a tree to be sampled using 
MCMC. 1221 

The tree can be searched for in a bottom-up fashion. 12 ' 1 

51.7 See also 

• Decision tree pruning 

• Binary decision diagram 

• CHAID 

• CART 

• ID 3 algorithm 

• C4.5 algorithm 

• Decision stump 

• Incremental decision tree 

• Alternating decision tree 

• Structured data analysis (statistics) 

51.8 Implementations 

Many data mining software packages provide implemen¬ 
tations of one or more decision tree algorithms. Several 
examples include Salford Systems CART (which licensed 
the proprietary code of the original CART authors 131 ), 
IBM SPSS Modeler, RapidMiner, SAS Enterprise Miner, 
Matlab, R (an open source software environment for sta¬ 
tistical computing which includes several CART imple¬ 
mentations such as rpart, party and randomForest pack¬ 
ages), Weka (a free and open-source data mining suite, 
contains many decision tree algorithms). Orange (a free 


data mining software suite, which includes the tree mod¬ 
ule orngTree), KNIME, Microsoft SQL Server , and 
scikit-learn (a free and open-source machine learning li¬ 
brary for the Python programming language). 
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51.10 External links 

• Building Decision Trees in Python From O'Reilly. 

• An Addendum to “Building Decision Trees in 
Python” From O'Reilly. 

• Decision Trees Tutorial using Microsoft Excel. 

• Decision Trees page at aitopics.org, a page with 
commented links. 

• Decision tree implementation in Ruby (AMR) 

• Evolutionary Learning of Decision Trees in C++ 

• Java implementation of Decision Trees based on In¬ 
formation Gain 

• A very explicit explanation of information gain as 
splitting criterion 


Chapter 52 

Information gain in decision trees 


For other uses, see Gain (disambiguation). 

In information theory and machine learning, informa¬ 
tion gain is a synonym for Kullback-Leibler divergence. 
However, in the context of decision trees, the term is 
sometimes used synonymously with mutual information, 
which is the expectation value of the Kullback-Leibler 
divergence of a conditional probability distribution. 

In particular, the information gain about a random vari¬ 
able X obtained from an observation that a random vari¬ 
able A takes the value A=a is the Kullback-Leibler diver¬ 
gence DKL{p(x I a) II p(x 11)) of the prior distribution p(x 
11) for x from the posterior distribution p(x I a) for x given 
a. 

The expected value of the information gain is the mutual 
information I(X;A) of X and A - i.e. the reduction in the 
entropy of X achieved by learning the state of the random 
variable A. 

In machine learning, this concept can be used to define 
a preferred sequence of attributes to investigate to most 
rapidly narrow down the state of X. Such a sequence 
(which depends on the outcome of the investigation of 
previous attributes at each stage) is called a decision tree. 
Usually an attribute with high mutual information should 
be preferred to other attributes. 

52.1 General definition 

In general terms, the expected information gain is the 
change in information entropy H from a prior state to 
a state that takes some information as given: 

IG{T,a) = H(T) — H(T\a) 

52.2 Formal definition 

Let T denote a set of training examples, each of the form 
(x, y) = (x 1 : x 2 ,X3,...,Xk,y) where x a G vals(a) is 
the value of the a th attribute of example x and y is the 
corresponding class label. The information gain for an 
attribute a is defined in terms of entropy H() as follows: 


IG(T,a ) = H(T)-j: vevaHa) J { yi -g(|x G 
T\x a = u}) 

The mutual information is equal to the total entropy for 
an attribute if for each of the attribute values a unique 
classification can be made for the result attribute. In this 
case, the relative entropies subtracted from the total en¬ 
tropy are 0. 

52.3 Drawbacks 

Although information gain is usually a good measure for 
deciding the relevance of an attribute, it is not perfect. A 
notable problem occurs when information gain is applied 
to attributes that can take on a large number of distinct 
values. For example, suppose that one is building a deci¬ 
sion tree for some data describing the customers of a busi¬ 
ness. Information gain is often used to decide which of 
the attributes are the most relevant, so they can be tested 
near the root of the tree. One of the input attributes might 
be the customer’s credit card number. This attribute has 
a high mutual information, because it uniquely identifies 
each customer, but we do not want to include it in the 
decision tree: deciding how to treat a customer based on 
their credit card number is unlikely to generalize to cus¬ 
tomers we haven't seen before (overfitting). 

Information gain ratio is sometimes used instead. This bi¬ 
ases the decision tree against considering attributes with a 
large number of distinct values. However, attributes with 
very low information values then appeared to receive an 
unfair advantage. 

52.4 References 

• Mitchell, Tom M. (1997). Machine Learning. The 
Mc-Graw-Hill Companies, Inc. ISBN 0070428077. 
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Chapter 53 

Ensemble learning 


For an alternative meaning, see variational Bayesian 
methods. 

In statistics and machine learning, ensemble meth¬ 
ods use multiple learning algorithms to obtain better 
predictive performance than could be obtained from any 
of the constituent learning algorithms. 11 ' 121131 Unlike a 
statistical ensemble in statistical mechanics, which is usu¬ 
ally infinite, a machine learning ensemble refers only to 
a concrete finite set of alternative models, but typically 
allows for much more flexible structure to exist among 
those alternatives. 


53.1 Overview 

Supervised learning algorithms are commonly described 
as performing the task of searching through a hypoth¬ 
esis space to find a suitable hypothesis that will make 
good predictions with a particular problem. Even if the 
hypothesis space contains hypotheses that are very well- 
suited for a particular problem, it may be very difficult to 
find a good one. Ensembles combine multiple hypotheses 
to form a (hopefully) better hypothesis. The term ensem¬ 
ble is usually reserved for methods that generate multi¬ 
ple hypotheses using the same base learner. The broader 
term of multiple classifier systems also covers hybridiza¬ 
tion of hypotheses that are not induced by the same base 
learner. 

Evaluating the prediction of an ensemble typically re¬ 
quires more computation than evaluating the prediction 
of a single model, so ensembles may be thought of as a 
way to compensate for poor learning algorithms by per¬ 
forming a lot of extra computation. Fast algorithms such 
as decision trees are commonly used with ensembles (for 
example Random Forest ), although slower algorithms can 
benefit from ensemble techniques as well. 


53.2 Ensemble theory 

An ensemble is itself a supervised learning algorithm, be¬ 
cause it can be trained and then used to make predic¬ 


tions. The trained ensemble, therefore, represents a sin¬ 
gle hypothesis. This hypothesis, however, is not necessar¬ 
ily contained within the hypothesis space of the models 
from which it is built. Thus, ensembles can be shown to 
have more flexibility in the functions they can represent. 
This flexibility can, in theory, enable them to over-fit the 
training data more than a single model would, but in prac¬ 
tice, some ensemble techniques (especially bagging) tend 
to reduce problems related to over-fitting of the training 
data. 

Empirically, ensembles tend to yield better results when 
there is a significant diversity among the models.' 4 " 51 
Many ensemble methods, therefore, seek to promote di¬ 
versity among the models they combine.' 6 " 7 ' Although 
perhaps non-intuitive, more random algorithms (like ran¬ 
dom decision trees) can be used to produce a stronger 
ensemble than very deliberate algorithms (like entropy- 
reducing decision trees).' 8 ' Using a variety of strong 
learning algorithms, however, has been shown to be more 
effective than using techniques that attempt to dumb- 
down the models in order to promote diversity.' 9 ' 

53.3 Common types of ensembles 

53.3.1 Bayes optimal classifier 

The Bayes Optimal Classifier is a classification technique. 
It is an ensemble of all the hypotheses in the hypothe¬ 
sis space. On average, no other ensemble can outper¬ 
form it. ' 101 Each hypothesis is given a vote proportional to 
the likelihood that the training dataset would be sampled 
from a system if that hypothesis were true. To facilitate 
training data of finite size, the vote of each hypothesis is 
also multiplied by the prior probability of that hypothesis. 
The Bayes Optimal Classifier can be expressed with the 
following equation: 

y = argmax c . eC ^ P(cj\hi)P{T\hi)P{hi) 

hiGH 

where y is the predicted class, C is the set of all possible 
classes, H is the hypothesis space, P refers to a probabil¬ 
ity, and T is the training data. As an ensemble, the Bayes 
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Optimal Classifier represents a hypothesis that is not nec¬ 
essarily in H . The hypothesis represented by the Bayes 
Optimal Classifier, however, is the optimal hypothesis in 
ensemble space (the space of all possible ensembles con¬ 
sisting only of hypotheses in H ). 

Unfortunately, Bayes Optimal Classifier cannot be prac¬ 
tically implemented for any but the most simple of prob¬ 
lems. There are several reasons why the Bayes Optimal 
Classifier cannot be practically implemented: 

1. Most interesting hypothesis spaces are too large to 
iterate over, as required by the argmax . 

2. Many hypotheses yield only a predicted class, rather 
than a probability for each class as required by the 
term P(cj\hi) . 

3. Computing an unbiased estimate of the probability 
of the training set given a hypothesis ( P(T\hi) ) is 
non-trivial. 

4. Estimating the prior probability for each hypothesis 
( P(hi) ) is rarely feasible. 

53.3.2 Bootstrap aggregating (bagging) 

Main article: Bootstrap aggregating 

Bootstrap aggregating, often abbreviated as bagging, in¬ 
volves having each model in the ensemble vote with 
equal weight. In order to promote model variance, bag¬ 
ging trains each model in the ensemble using a ran¬ 
domly drawn subset of the training set. As an exam¬ 
ple, the random forest algorithm combines random de¬ 
cision trees with bagging to achieve very high classifica¬ 
tion accuracy. 1111 An interesting application of bagging in 
unsupervised learning is provided here. 11211131 

53.3.3 Boosting 

Main article: Boosting (meta-algorithm) 

Boosting involves incrementally building an ensemble by 
training each new model instance to emphasize the train¬ 
ing instances that previous models mis-classified. In some 
cases, boosting has been shown to yield better accuracy 
than bagging, but it also tends to be more likely to over-fit 
the training data. By far, the most common implementa¬ 
tion of Boosting is Adaboost, although some newer algo¬ 
rithms are reported to achieve better results . 

53.3.4 Bayesian model averaging 

Bayesian model averaging (BMA) is an ensemble tech¬ 
nique that seeks to approximate the Bayes Optimal Clas¬ 
sifier by sampling hypotheses from the hypothesis space. 


and combining them using Bayes’ law. 1141 Unlike the 
Bayes optimal classifier, Bayesian model averaging can be 
practically implemented. Hypotheses are typically sam¬ 
pled using a Monte Carlo sampling technique such as 
MCMC. For example, Gibbs sampling may be used to 
draw hypotheses that are representative of the distribu¬ 
tion P(T\H) . It has been shown that under certain cir¬ 
cumstances, when hypotheses are drawn in this manner 
and averaged according to Bayes’ law, this technique has 
an expected error that is bounded to be at most twice the 
expected error of the Bayes optimal classifier. 1151 Despite 
the theoretical correctness of this technique, it has been 
found to promote over-fitting and to perform worse, em¬ 
pirically, compared to simpler ensemble techniques such 
as bagging; 116 ' however, these conclusions appear to be 
based on a misunderstanding of the purpose of Bayesian 
model averaging vs. model combination. 117 ' 


53.3.5 Bayesian model combination 

Bayesian model combination (BMC) is an algorithmic 
correction to BMA. Instead of sampling each model in 
the ensemble individually, it samples from the space of 
possible ensembles (with model weightings drawn ran¬ 
domly from a Dirichlet distribution having uniform pa¬ 
rameters). This modification overcomes the tendency of 
BMA to converge toward giving all of the weight to a 
single model. Although BMC is somewhat more compu¬ 
tationally expensive than BMA, it tends to yield dramat¬ 
ically better results. The results from BMC have been 
shown to be better on average (with statistical signifi¬ 
cance) than BMA, and bagging. 1181 

The use of Bayes’ law to compute model weights neces¬ 
sitates computing the probability of the data given each 
model. Typically, none of the models in the ensemble are 
exactly the distribution from which the training data were 
generated, so all of them correctly receive a value close 
to zero for this term. This would work well if the ensem¬ 
ble were big enough to sample the entire model-space, 
but such is rarely possible. Consequently, each pattern in 
the training data will cause the ensemble weight to shift 
toward the model in the ensemble that is closest to the 
distribution of the training data. It essentially reduces to 
an unnecessarily complex method for doing model selec¬ 
tion. 

The possible weightings for an ensemble can be visualized 
as lying on a simplex. At each vertex of the simplex, all 
of the weight is given to a single model in the ensemble. 
BMA converges toward the vertex that is closest to the 
distribution of the training data. By contrast, BMC con¬ 
verges toward the point where this distribution projects 
onto the simplex. In other words, instead of selecting the 
one model that is closest to the generating distribution, 
it seeks the combination of models that is closest to the 
generating distribution. 

The results from BMA can often be approximated by us- 
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ing cross-validation to select the best model from a bucket 
of models. Likewise, the results from BMC may be ap¬ 
proximated by using cross-validation to select the best en¬ 
semble combination from a random sampling of possible 
weightings. 

53.3.6 Bucket of models 

A “bucket of models” is an ensemble in which a model 
selection algorithm is used to choose the best model for 
each problem. When tested with only one problem, a 
bucket of models can produce no better results than the 
best model in the set, but when evaluated across many 
problems, it will typically produce much better results, 
on average, than any model in the set. 

The most common approach used for model-selection is 
cross-validation selection (sometimes called a “bake-off 
contest”). It is described with the following pseudo-code: 

For each model m in the bucket: Do c times: (where 'c' is 
some constant) Randomly divide the training dataset into 
two datasets: A, and B. Train m with A Test m with B 
Select the model that obtains the highest average score 

Cross-Validation Selection can be summed up as: “try 
them all with the training set, and pick the one that works 
best”. [19] 

Gating is a generalization of Cross-Validation Selection. 
It involves training another learning model to decide 
which of the models in the bucket is best-suited to solve 
the problem. Often, a perceptron is used for the gating 
model. It can be used to pick the “best” model, or it can 
be used to give a linear weight to the predictions from 
each model in the bucket. 

When a bucket of models is used with a large set of prob¬ 
lems, it may be desirable to avoid training some of the 
models that take a long time to train. Landmark learn¬ 
ing is a meta-learning approach that seeks to solve this 
problem. It involves training only the fast (but imprecise) 
algorithms in the bucket, and then using the performance 
of these algorithms to help determine which slow (but ac¬ 
curate) algorithm is most likely to do best. 1201 

53.3.7 Stacking 

Stacking (sometimes called stacked generalization) in¬ 
volves training a learning algorithm to combine the pre¬ 
dictions of several other learning algorithms. First, all of 
the other algorithms are trained using the available data, 
then a combiner algorithm is trained to make a final pre¬ 
diction using all the predictions of the other algorithms as 
additional inputs. If an arbitrary combiner algorithm is 
used, then stacking can theoretically represent any of the 
ensemble techniques described in this article, although in 
practice, a single-layer logistic regression model is often 
used as the combiner. 


Stacking typically yields performance better than any sin¬ 
gle one of the trained models. 1211 It has been successfully 
used on both supervised learning tasks (regression, 1221 
classification and distance learning 1231 ) and unsupervised 
learning (density estimation). 1241 It has also been used to 
estimate bagging’s error rate. 1311251 It has been reported to 
out-perform Bayesian model-averaging. 1261 The two top- 
performers in the Netflix competition utilized blending, 
which may be considered to be a form of stacking. 1271 
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• Ensemble learning at Scholarpedia, curated by Robi 
Polikar. 

• The Waffles (machine learning) toolkit contains 
implementations of Bagging, Boosting, Bayesian 
Model Averaging, Bayesian Model Combination, 
Bucket-of-models, and other ensemble techniques 
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• Zhou Zhihua (2012). Ensemble Methods: Foun¬ 
dations and Algorithms. Chapman and Hall/CRC. 
ISBN 978-1-439-83003-1. 


Chapter 54 

Random forest 


This article is about the machine learning technique. 
For other kinds of random tree, see Random tree 
(disambiguation) . 

Random forests are an ensemble learning method for 
classification, regression and other tasks, that operate by 
constructing a multitude of decision trees at training time 
and outputting the class that is the mode of the classes 
(classification) or mean prediction (regression) of the in¬ 
dividual trees. Random forests correct for decision trees’ 
habit of overfitting to their training set. 

The algorithm for inducing a random forest was devel¬ 
oped by Leo Breiman 111 and Adele Cutler, 12 and “Ran¬ 
dom Forests” is their trademark. The method combines 
Breiman’s "bagging" idea and the random selection of 
features, introduced independently by Ho 131 141 and Amit 
and Geman 151 in order to construct a collection of deci¬ 
sion trees with controlled variance. 

The selection of a random subset of features is an exam¬ 
ple of the random subspace method, which, in Ho’s for¬ 
mulation, is a way to implement classification proposed 
by Eugene Kleinberg. 161 


54.1 History 

The early development of random forests was influenced 
by the work of Amit and Geman 151 who introduced the 
idea of searching over a random subset of the available 
decisions when splitting a node, in the context of growing 
a single tree. The idea of random subspace selection from 
Ho 141 was also influential in the design of random forests. 
In this method a forest of trees is grown, and variation 
among the trees is introduced by projecting the training 
data into a randomly chosen subspace before fitting each 
tree. Finally, the idea of randomized node optimization, 
where the decision at each node is selected by a random¬ 
ized procedure, rather than a deterministic optimization 
was first introduced by Dietterich. 171 

The introduction of random forests proper was first made 
in a paper by Leo Breiman. 111 This paper describes a 
method of building a forest of uncorrelated trees using a 
CART like procedure, combined with randomized node 


optimization and bagging. In addition, this paper com¬ 
bines several ingredients, some previously known and 
some novel, which form the basis of the modern practice 
of random forests, in particular: 

1. Using out-of-bag error as an estimate of the 
generalization error. 

2. Measuring variable importance through permuta¬ 
tion. 

The report also offers the first theoretical result for ran¬ 
dom forests in the form of a bound on the generalization 
error which depends on the strength of the trees in the 
forest and their correlation. 


54.2 Algorithm 

54.2.1 Preliminaries: decision tree learn¬ 
ing 

Main article: Decision tree learning 

Decision trees are a popular method for various machine 
learning tasks. Tree learning “come[s] closest to meeting 
the requirements for serving as an off-the-shelf procedure 
for data mining”, say Hastie et al., because it is invariant 
under scaling and various other transformations of feature 
values, is robust to inclusion of irrelevant features, and 
produces inspectable models. However, they are seldom 
accurate. 181352 

In particular, trees that are grown very deep tend to learn 
highly irregular patterns: they overfit their training sets, 
because they have low bias, but very high variance. Ran¬ 
dom forests are a way of averaging multiple deep deci¬ 
sion trees, trained on different parts of the same training 
set, with the goal of reducing the variance. [8):587_588 This 
comes at the expense of a small increase in the bias and 
some loss of interpretability, but generally greatly boosts 
the performance of the final model. 
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54.2.2 Tree bagging 

Main article: Bootstrap aggregating 

The training algorithm for random forests applies the gen¬ 
eral technique of bootstrap aggregating, or bagging, to 
tree learners. Given a training set X = xi, ..., x n with 
responses Y = y 1? ..., y„, bagging repeatedly selects a 
random sample with replacement of the training set and 
fits trees to these samples: 

For b = 1, ..., B: 

1. Sample, with replacement, n training ex¬ 
amples from X, Y; call these X b , Y b . 

2. Train a decision or regression tree f b on 
X b , Y b . 

After training, predictions for unseen samples x' can be 
made by averaging the predictions from all the individual 
regression trees on x': 

/ = \ H M x ') 

6=1 

or by taking the majority vote in the case of decision trees. 

This bootstrapping procedure leads to better model per¬ 
formance because it decreases the variance of the model, 
without increasing the bias. This means that while the 
predictions of a single tree are highly sensitive to noise 
in its training set, the average of many trees is not, as 
long as the trees are not correlated. Simply training many 
trees on a single training set would give strongly corre¬ 
lated trees (or even the same tree many times, if the train¬ 
ing algorithm is deterministic); bootstrap sampling is a 
way of de-correlating the trees by showing them differ¬ 
ent training sets. 

The number of samples/trees, B, is a free parameter. 
Typically, a few hundred to several thousand trees are 
used, depending on the size and nature of the training set. 
An optimal number of trees B can be found using cross- 
validation, or by observing the out-of-bag error: the mean 
prediction error on each training sample x b using only the 
trees that did not have X; in their bootstrap sample. 191 The 
training and test error tend to level off after some number 
of trees have been fit. 

54.2.3 From bagging to random forests 

Main article: Random subspace method 

The above procedure describes the original bagging algo¬ 
rithm for trees. Random forests differ in only one way 
from this general scheme: they use a modified tree learn¬ 
ing algorithm that selects, at each candidate split in the 


learning process, a random subset of the features. This 
process is sometimes called “feature bagging”. The rea¬ 
son for doing this is the correlation of the trees in an or¬ 
dinary bootstrap sample: if one or a few features are very 
strong predictors for the response variable (target output), 
these features will be selected in many of the B trees, 
causing them to become correlated. 

Typically, for a dataset with p features, v/p features are 
used in each split. 

54.2.4 Extensions 

Adding one further step of randomization yields extremely 
randomized trees, or ExtraTrees. These are trained using 
bagging and the random subspace method, like in an or¬ 
dinary random forest, but additionally the top-down split¬ 
ting in the tree learner is randomized. Instead of comput¬ 
ing the locally optimal feature/split combination (based 
on, e.g., information gain or the Gini impurity), for each 
feature under consideration a random value is selected in 
the feature’s empirical range (in the tree’s training set, i.e., 
the bootstrap sample). The best of these is then chosen 
as the split. [10] 

54.3 Properties 

54.3.1 Variable importance 

Random forests can be used to rank the importance of 
variables in a regression or classification problem in a 
natural way. The following technique was described in 
Breiman’s original paper 111 and is implemented in the R 
package randomForest. 121 

The first step in measuring the variable importance in a 
data set T> n = {(X,. Y))}” =1 is to fit a random forest to 
the data. During the fitting process the out-of-bag error 
for each data point is recorded and averaged over the for¬ 
est (errors on an independent test set can be substituted 
if bagging is not used during training). 

To measure the importance of the j -th feature after train¬ 
ing, the values of the j -th feature are permuted among 
the training data and the out-of-bag error is again com¬ 
puted on this perturbed data set. The importance score 
for the j -th feature is computed by averaging the differ¬ 
ence in out-of-bag error before and after the permutation 
over all trees. The score is normalized by the standard 
deviation of these differences. 

Features which produce large values for this score are 
ranked as more important than features which produce 
small values. 

This method of determining variable importance has 
some drawbacks. For data including categorical variables 
with different number of levels, random forests are biased 
in favor of those attributes with more levels. Methods 
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such as partial permutations 11111121 and growing unbiased 
trees 1131 can be used to solve the problem. If the data 
contain groups of correlated features of similar relevance 
for the output, then smaller groups are favored over larger 
groups. 1141 

54.3.2 Relationship to nearest neighbors 

A relationship between random forests and the k-nearest 
neighbor algorithm (k-NN) was pointed out by Lin and 
Jeon in 2002. 1151 It turns out that both can be viewed 
as so-called weighted neighborhoods schemes. These are 
models built from a training set {(xi, t/i)}™ =1 that make 
predictions y for new points x' by looking at the “neigh¬ 
borhood” of the point, formalized by a weight function 
W: 


can also define an RF dissimilarity measure between un¬ 
labeled data: the idea is to construct an RF predictor that 
distinguishes the “observed” data from suitably generated 
synthetic data. 1111161 The observed data are the original 
unlabeled data and the synthetic data are drawn from a 
reference distribution. An RF dissimilarity can be at¬ 
tractive because it handles mixed variable types well, is 
invariant to monotonic transformations of the input vari¬ 
ables, and is robust to outlying observations. The RF 
dissimilarity easily deals with a large number of semi- 
continuous variables due to its intrinsic variable selection; 
for example, the “Addcl 1” RF dissimilarity weighs the 
contribution of each variable according to how dependent 
it is on other variables. The RF dissimilarity has been 
used in a variety of applications, e.g. to find clusters of 
patients based on tissue marker data. 1171 


n 

V = y^,W(xj,x')yi. 

2 = 1 

Here, W(xi,x') is the non-negative weight of the i'th 
training point relative to the new point x'. For any par¬ 
ticular x', the weights must sum to one. Weight functions 
are given as follows: 


54.5 Variants 

Instead of decision trees, linear models have been pro¬ 
posed and evaluated as base estimators in random forests, 
in particular multinomial logistic regression and naive 
Bayes classifiers. 11811191 


• In k-NN, the weights are W(xi, x') = ! if Xj is one 
of the k points closest to x', and zero otherwise. 

• In a tree, W(xi,x') is the fraction of the training 
data that falls into the same leaf as x'. 

Since a forest averages the predictions of a set of m trees 
with individual weight functions W :J , its predictions are 


1 m n n ( 1 m 

= “EE^ (*<> x ') Vi = Z) [m Z) W i (*<> *') 

3=1 i=i 2=1 y j=i 


This shows that the whole forest is again a weighted neigh¬ 
borhood scheme, with weights that average those of the 
individual trees. The neighbors of x' in this interpreta¬ 
tion are the points X{ which fall in the same leaf as x' in 
at least one tree of the forest. In this way, the neighbor¬ 
hood of x' depends in a complex way on the structure of 
the trees, and thus on the structure of the training set. Lin 
and Jeon show that the shape of the neighborhood used 
by a random forest adapts to the local importance of each 
feature. 1151 


54.4 Unsupervised learning with 
random forests 

As part of their construction, RF predictors naturally lead 
to a dissimilarity measure between the observations. One 


54.6 See also 

• Decision tree learning 

• Gradient boosting 

• Randomized algorithm 

• Bootstrap aggregating (bagging) 

) • Ensemble learning 

Vi ■ 

• Boosting 

• Non-parametric statistics 
• Kernel random forest 
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categorization of text documents supporting read¬ 
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Chapter 55 

Boosting (machine learning) 


Boosting is a machine learning ensemble meta-algorithm 
for reducing bias primarily and also variance 111 in 
supervised learning, and a family of machine learning al¬ 
gorithms which convert weak learners to strong ones. 121 
Boosting is based on the question posed by Kearns and 
Valiant (1988, 1989): [31[4] Can a set of weak learners 
create a single strong learner? A weak learner is de¬ 
fined to be a classifier which is only slightly correlated 
with the true classification (it can label examples better 
than random guessing). In contrast, a strong learner is a 
classifier that is arbitrarily well-correlated with the true 
classification. 

Robert Schapire’s affirmative answer in a 1990 paper 151 
to the question of Kearns and Valiant has had significant 
ramifications in machine learning and statistics, most no¬ 
tably leading to the development of boosting. 161 

When first introduced, the hypothesis boosting problem 
simply referred to the process of turning a weak learner 
into a strong learner. “Informally, [the hypothesis boost¬ 
ing] problem asks whether an efficient learning algorithm 
[...] that outputs a hypothesis whose performance is only 
slightly better than random guessing [i.e. a weak learner] 
implies the existence of an efficient algorithm that out¬ 
puts a hypothesis of arbitrary accuracy [i.e. a strong 
learner].” 131 Algorithms that achieve hypothesis boost¬ 
ing quickly became simply known as “boosting”. Fre¬ 
und and Schapire’s arcing (Adapt[at]ive Resampling and 
Combining), 171 as a general technique, is more or less syn¬ 
onymous with boosting. 181 


55.1 Boosting algorithms 

While boosting is not algorithmically constrained, most 
boosting algorithms consist of iteratively learning weak 
classifiers with respect to a distribution and adding them 
to a final strong classifier. When they are added, they 
are typically weighted in some way that is usually related 
to the weak learners’ accuracy. After a weak learner is 
added, the data is reweighted: examples that are misclas- 
sified gain weight and examples that are classified cor¬ 
rectly lose weight (some boosting algorithms actually de¬ 
crease the weight of repeatedly misclassified examples. 


e.g., boost by majority and BrownBoost). Thus, future 
weak learners focus more on the examples that previous 
weak learners misclassified. 

There are many boosting algorithms. The original ones, 
proposed by Robert Schapire (a recursive majority gate 
formulation 151 ) and Yoav Freund (boost by majority 191 ), 
were not adaptive and could not take full advantage of 
the weak learners. However, Schapire and Freund then 
developed AdaBoost, an adaptive boosting algorithm that 
won the prestigious Godel Prize. Only algorithms that 
are provable boosting algorithms in the probably approx¬ 
imately correct learning formulation are called boosting 
algorithms. Other algorithms that are similar in spirit to 
boosting algorithms are sometimes called “leveraging al¬ 
gorithms”, although they are also sometimes incorrectly 
called boosting algorithms. 191 

55.2 Examples of boosting algo¬ 
rithms 

The main variation between many boosting algorithms is 
their method of weighting training data points and hy¬ 
potheses. AdaBoost is very popular and perhaps the 
most significant historically as it was the first algo¬ 
rithm that could adapt to the weak learners. However, 
there are many more recent algorithms such as LPBoost, 
TotalBoost, BrownBoost, MadaBoost, LogitBoost, and 
others. Many boosting algorithms fit into the AnyBoost 
framework, 191 which shows that boosting performs 
gradient descent in function space using a convex cost 
function. 

Boosting algorithms are used in Computer Vision, where 
individual classifiers detecting contrast changes can be 
combined to identify Facial Features. 1101 

55.3 Criticism 

In 2008 Phillip Long (at Google) and Rocco A. Servedio 
(Columbia University) published a paper 1111 at the 25th 
International Conference for Machine Learning suggest¬ 
ing that many of these algorithms are probably flawed. 
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They conclude that “convex potential boosters cannot 
withstand random classification noise,” thus making the 
applicability of such algorithms for real world, noisy data 
sets questionable. The paper shows that if any non-zero 
fraction of the training data is mis-labeled, the boost¬ 
ing algorithm tries extremely hard to correctly classify 
these training examples, and fails to produce a model 
with accuracy better than 1/2. This result does not ap¬ 
ply to branching program based boosters but does apply 
to AdaBoost, LogitBoost, and others. 1 121111 

55.4 See also 

55.5 Implementations 

• Orange, a free data mining software suite, module 
Orange .ensemble 

• Weka is a machine learning set of tools that offers 
variate implementations of boosting algorithms like 
AdaBoost and LogitBoost 

• R package GBM (Generalized Boosted Regres¬ 
sion Models) implements extensions to Freund and 
Schapire’s AdaBoost algorithm and Friedman’s gra¬ 
dient boosting machine. 

• jboost; AdaBoost, LogitBoost, RobustBoost, Boos- 
texter and alternating decision trees 
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Chapter 56 

Bootstrap aggregating 


Bootstrap aggregating, also called bagging, is a 
machine learning ensemble meta-algorithm designed to 
improve the stability and accuracy of machine learning 
algorithms used in statistical classification and regression. 
It also reduces variance and helps to avoid overfitting. Al¬ 
though it is usually applied to decision tree methods, it 
can be used with any type of method. Bagging is a spe¬ 
cial case of the model averaging approach. 


56.1 Description of the technique 

Given a standard training set D of size n, bagging gener¬ 
ates m new training sets D, , each of size n ' by sampling 
from D uniformly and with replacement. By sampling 
with replacement, some observations may be repeated in 
each Di . If n =n, then for large n the set D, ; is expected to 
have the fraction (1 - He) (^63.2%) of the unique exam¬ 
ples of D , the rest being duplicates. 11 1 This kind of sample 
is known as a bootstrap sample. The m models are fitted 
using the above m bootstrap samples and combined by 
averaging the output (for regression) or voting (for clas¬ 
sification). 

Bagging leads to “improvements for unstable procedures” 
(Breiman, 1996), which include, for example, artificial 
neural networks, classification and regression trees, and 
subset selection in linear regression (Breiman, 1994). An 
interesting application of bagging showing improvement 
in preimage learning is provided here. 121131 On the other 
hand, it can mildly degrade the performance of stable 
methods such as K-nearest neighbors (Breiman, 1996). 


56.2 Example: Ozone data 

To illustrate the basic principles of bagging, below is an 
analysis on the relationship between ozone and temper¬ 
ature (data from Rousseeuw and Leroy (1986), available 
at classic data sets, analysis done in R). 

The relationship between temperature and ozone in this 
data set is apparently non-linear, based on the scatter plot. 
To mathematically describe this relationship, LOESS 
smoothers (with span 0.5) are used. Instead of building a 


single smoother from the complete data set, 100 bootstrap 
samples of the data were drawn. Each sample is differ¬ 
ent from the original data set, yet resembles it in distribu¬ 
tion and variability. For each bootstrap sample, a LOESS 
smoother was fit. Predictions from these 100 smoothers 
were then made across the range of the data. The first 
10 predicted smooth fits appear as grey lines in the figure 
below. The lines are clearly very wiggly and they overfit 
the data - a result of the span being too low. 

By taking the average of 100 smoothers, each fitted to a 
subset of the original data set, we arrive at one bagged 
predictor (red line). Clearly, the mean is more stable and 
there is less overfit. 
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56.3 Bagging for nearest neighbour 
classifiers 

The risk of a 1 nearest neighbour (INN) classifier is at 
most twice the risk of the Bayes classifier J 41 but there are 
no guarantees that this classifier will be consistent. By 
careful choice of the size of the resamples, bagging can 
lead to substantial improvements of the performance of 
the INN classifier. By taking a large number of resam¬ 
ples of the data of size n ', the bagged nearest neighbour 
classifier will be consistent provided n' —> oo diverges 
but n'/ro —> 0 as the sample size n —> oo . 

Under infinite simulation, the bagged nearest neighbour 
classifier can be viewed as a weighted nearest neighbour 
classifier. Suppose that the feature space is d dimensional 
and denote by ^ the bagged nearest neighbour classi¬ 
fier based on a training set of size n , with resamples of 
size n'. In the infinite sampling case, under certain regu¬ 
larity conditions on the class distributions, the excess risk 
has the following asymptotic expansion 151 
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nition Workshop, pp.1-7, 2011. 

[3] Shinde. Amit, Anshuman Sahu, Daniel Apley. and George 
Runger. “Preimages for Variation Patterns from Kernel 
PCA and Bagging.” HE Transactions, Vol.46, Iss.5, 2014 

[4] Castelli, Vittorio. “Nearest Neighbor Classifiers, p.5” 
(PDF ). columbia.edu. Columbia University. Retrieved 25 
April 2015. 

[5] Samworth R. J. (2012). “Optimal weighted nearest neigh¬ 
bour classifiers”. Annals of Statistics 40 (5): 2733-2763. 
doi: 10.1214/12-AOS 1049. 

• Breiman, Leo (1996). “Bagging predic¬ 
tors”. Machine Learning 24 (2): 123— 
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for some constants B \ and B^ . The optimal choice of n' 

, that balances the two terms in the asymptotic expansion, 
is given by n' = Bn d ^ d+/ ^ for some constant B . 




56.4 History 

Bagging (Bootstrap aggregating) was proposed by Leo 
Breiman in 1994 to improve the classification by com¬ 
bining classifications of randomly generated training sets. 
See Breiman, 1994. Technical Report No. 421. 


56.5 See also 

• Boosting (meta-algorithm) 

• Bootstrapping (statistics) 

• Cross-validation (statistics) 

• Random forest 

• Random subspace method (attribute bagging) 
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Chapter 57 

Gradient boosting 


Gradient boosting is a machine learning technique for 
regression and classification problems, which produces a 
prediction model in the form of an ensemble of weak 
prediction models, typically decision trees. It builds the 
model in a stage-wise fashion like other boosting meth¬ 
ods do, and it generalizes them by allowing optimization 
of an arbitrary differentiable loss function. 

The idea of gradient boosting originated in the observa¬ 
tion by Leo Breiman 111 that boosting can be interpreted as 
an optimization algorithm on a suitable cost function. Ex¬ 
plicit regression gradient boosting algorithms were subse¬ 
quently developed by Jerome H. Friedman 121 13] simulta¬ 
neously with the more general functional gradient boost¬ 
ing perspective of Llew Mason, Jonathan Baxter, Peter 
Bartlett and Marcus Frean . |4||S| The latter two papers in¬ 
troduced the abstract view of boosting algorithms as itera¬ 
tive functional gradient descent algorithms. That is, algo¬ 
rithms that optimize a cost functional over function space 
by iteratively choosing a function (weak hypothesis) that 
points in the negative gradient direction. This functional 
gradient view of boosting has led to the development of 
boosting algorithms in many areas of machine learning 
and statistics beyond regression and classification. 


57.1 Informal introduction 

(This section follows the exposition of gradient boosting 
by Li. [6] ) 

Like other boosting methods, gradient boosting combines 
weak learners into a single strong learner, in an itera¬ 
tive fashion. It is easiest to explain in the least-squares 
regression setting, where the goal is to learn a model F 
that predicts values y = F(x) , minimizing the mean 
squared error (y — y ) 2 to the true values y (averaged over 
some training set). 

At each stage 1 < m < M of gradient boosting, it may 
be assumed that there is some imperfect model F m (at 
the outset, a very weak model that just predicts the mean 
y in the training set could be used). The gradient boost¬ 
ing algorithm does not change F m in any way; instead, 
it improves on it by constructing a new model that adds 
an estimator h to provide a better model F m+ i(x) = 


F m (x) + h(x). The question is now, how to find h ? The 
gradient boosting solution starts with the observation that 
a perfect h would imply 

Fm+1 = F m (x) + h(x) = y 

or, equivalently, 

h(x) = y- F m (x) 

Therefore, gradient boosting will fit h to the residual 
y — F rn (x) . Like in other boosting variants, each F m+1 
learns to correct its predecessor F m . A generalization of 
this idea to other loss functions than squared error (and 
to classification and ranking problems) follows from the 
observation that residuals y — F(x) are the negative gra¬ 
dients of the squared error loss function 2 (y — F{x)) 2 . 
So, gradient boosting is a gradient descent algorithm; and 
generalizing it entails “plugging in” a different loss and its 
gradient. 

57.2 Algorithm 

In many supervised learning problems one has an output 
variable y and a vector of input variables x connected to¬ 
gether via a joint probability distribution P(x, y). Using 
a training set {(aq, yf ),..., (x n , y n )} of known values 
of x and corresponding values of y, the goal is to find an 
approximation F(x) to a function F*(x) that minimizes 
the expected value of some specified loss function L( y. 
Fix)): 

F* = argminE x , y [L(y, F(x))] 

F 

Gradient boosting method assumes a real-valued y and 
seeks an approximation F(x) in the form of a weighted 
sum of functions hi(x) from some class -XL, called base 
(or weak) learners: 

M 

f( X ) = j 2 "fihi(x) + const 

i=1 
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In accordance with the empirical risk minimization prin¬ 
ciple, the method tries to find an approximation F(x) that 
minimizes the average value of the loss function on the 
training set. It does so by starting with a model, con¬ 
sisting of a constant function F 0 (x) . and incrementally 
expanding it in a greedy fashion: 


F 0 (x) =argmin^L(t/ i , 7 ) 

7 i =l 

n 

F m (x) = F m _i(x)+argrniny^ L(yi, F rn _ 1 (x i )+f(x i 

i=l 

where f is restricted to be a function from the class -‘/t of 
base learner functions. 

However, the problem of choosing at each step the best 
f for an arbitrary loss function L is a hard optimization 
problem in general, and so we'll “cheat” by solving a much 
easier problem instead. 

The idea is to apply a steepest descent step to this min¬ 
imization problem. If we only cared about predictions 
at the points of the training set, and f were unrestricted, 
we'd update the model per the following equation, where 
we view L(y, f ) not as a functional of f, but as a function 
of a vector of values/( 27 ),..., /( x n ) : 

n 

Fm (*£) — F m — i (a:) 7 m ^ F rn —\(x %)), 

i—l 


(b) Fit a base learner h m (x) to pseudo¬ 
residuals, i.e. train it using the training 

set {(xi, rj m )}" =1 . 

(c) Compute multiplier 7 m by solving the follow¬ 
ing one-dimensional optimization problem: 

n 

7 m = arg min ^ L (y u F m -i{x t ) + 7 h m {xi)). 
7 i =1 

(d) Update the model: 

Frn (x) = F m _ l(x) -t- 7 mhm(x) • 

3. Output Fm{x). 


57.3 Gradient tree boosting 

Gradient boosting is typically used with decision trees 
(especially CART trees) of a fixed size as base learners. 
For this special case Friedman proposes a modification to 
gradient boosting method which improves the quality of 
fit of each base learner. 

Generic gradient boosting at the m-th step would fit a de¬ 
cision tree h m (x) to pseudo-residuals. Let J be the num¬ 
ber of its leaves. The tree partitions the input space into 
J disjoint regions..., Rj m and predicts a constant 
value in each region. Using the indicator notation, the 
output of h m (x) for input x can be written as the sum: 


7 m = arg min 

7 

But as f must come from a restricted class of functions 
(that’s what allows us to generalize), we'll just choose 
the one that most closely approximates the gradient of L. 
Having chosen f, the multiplier y is then selected using 
line search just as shown in the second equation above. 

In pseudocode, the generic gradient boosting method 

is:' 2 -! 


j 

^m(^) = ^ ' bj m I{x C Rjm'ji 

3 =1 

wher ebj m is the value predicted in the region Rj m . [8] 

Then the coefficients bj m are multiplied by some value 
7 m , chosen using line search so as to minimize the loss 
function, and the model is updated as follows: 


i=l 


^ L I Vi, F m —i(xi) -7 


dL(y i ,F m _ 1 (x i )) 

df(xi) 


Input: training set {( 27 , yi)}™=±, a differentiable loss 
functionary, F(x)), number of iterationsM. 

Algorithm: 

1. Initialize model with a constant value: 

n 

F 0 (x) = argmin^L(yi, 7 ). 

7 i =1 

2. For m = 1 to M: 


F m (x) = F m _i(x)+ r y m h rn (x), 7 m = arg min ^ L(yj, 

7 i =1 

Friedman proposes to modify this algorithm so that it 
chooses a separate optimal value 7 Jm for each of the tree’s 
regions, instead of a single 7 m for the whole tree. He calls 
the modified algorithm “TreeBoost”. The coefficients 
bj m from the tree-fitting procedure can be then simply 
discarded and the model update rule becomes: 


(a) Compute so-called pseudo-residuals: 


T'im. — 


dL(yi,F(xj)) 

dF(xi) 


for i — 1 ... (a) — F m — 1 (*£)“!■ ^ ^ "fj m I{x £ Rjm)i 7 jfn — arg min 

7 


F(x)=F m - i(x) 


3 =1 


m-l{x i )+'yi 


L (Vi 

XiERjm 
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57.3.1 Size of trees 

J , the number of terminal nodes in trees, is the method’s 
parameter which can be adjusted for a data set at hand. 
It controls the maximum allowed level of interaction be¬ 
tween variables in the model. With J = 2 (decision 
stumps), no interaction between variables is allowed. 
With J = 3 the model may include effects of the inter¬ 
action between up to two variables, and so on. 

Hastie et aU 71 comment that typically 4 < ./ < 8 work 
well for boosting and results are fairly insensitive to the 
choice of J in this range, J = 2 is insufficient for many 
applications, and J > 10 is unlikely to be required. 

57.4 Regularization 

Fitting the training set too closely can lead to degradation 
of the model’s generalization ability. Several so-called 
regularization techniques reduce this overiitting effect by 
constraining the fitting procedure. 

One natural regularization parameter is the number of 
gradient boosting iterations M (i.e. the number of trees 
in the model when the base learner is a decision tree). In¬ 
creasing M reduces the error on training set, but setting it 
too high may lead to overfitting. An optimal value of M 
is often selected by monitoring prediction error on a sep¬ 
arate validation data set. Besides controlling M, several 
other regularization techniques are used. 

57.4.1 Shrinkage 

An important part of gradient boosting method is regu¬ 
larization by shrinkage which consists in modifying the 
update rule as follows: 


F m {x) = F m _i(x) + v • 7 m h m {x), 0 < v < 1, 

where parameter v is called the “learning rate”. 

Empirically it has been found that using small learning 
rates (such as v < 0.1 ) yields dramatic improvements 
in model’s generalization ability over gradient boosting 
without shrinking ( v - 1 ). [71 However, it comes at the 
price of increasing computational time both during train¬ 
ing and querying: lower learning rate requires more iter¬ 
ations. 

57.4.2 Stochastic gradient boosting 

Soon after the introduction of gradient boosting Fried¬ 
man proposed a minor modification to the algorithm, mo¬ 
tivated by Breiman's bagging method. 131 Specifically, he 
proposed that at each iteration of the algorithm, a base 
learner should be fit on a subsample of the training set 


drawn at random without replacement. 191 Friedman ob¬ 
served a substantial improvement in gradient boosting’s 
accuracy with this modification. 

Subsample size is some constant fraction / of the size of 
the training set. When / = 1, the algorithm is determin¬ 
istic and identical to the one described above. Smaller 
values of / introduce randomness into the algorithm and 
help prevent overfitting, acting as a kind of regularization. 
The algorithm also becomes faster, because regression 
trees have to be fit to smaller datasets at each iteration. 
Friedman 131 obtained that0.5 < / < 0.8 leads to good 
results for small and moderate sized training sets. There¬ 
fore, / is typically set to 0.5, meaning that one half of the 
training set is used to build each base learner. 

Also, like in bagging, subsampling allows one to define 
an out-of-bag estimate of the prediction performance im¬ 
provement by evaluating predictions on those observa¬ 
tions which were not used in the building of the next base 
learner. Out-of-bag estimates help avoid the need for an 
independent validation dataset, but often underestimate 
actual performance improvement and the optimal num¬ 
ber of iterations 3 101 


57.4.3 Number of observations in leaves 

Gradient tree boosting implementations often also use 
regularization by limiting the minimum number of obser¬ 
vations in trees’ terminal nodes (this parameter is called 
n.minobsinnode in the R gbm package 11()| ). It is used in 
the tree building process by ignoring any splits that lead 
to nodes containing fewer than this number of training set 
instances. 

Imposing this limit helps to reduce variance in predictions 
at leaves. 


57.4.4 Penalize Complexity of Tree 

Another useful regularization techniques for gradient 
boosted trees is to penalize model complexity of the 
learned model. 1111 The model complexity can be defined 
proportional number of leaves in the learned trees. The 
jointly optimization of loss and model complexity corre¬ 
sponds to a post-pruning algorithm to remove branches 
that fail to reduce the loss by a threshold. Other kinds of 
regularization such as 12 penalty on the leave values can 
also be added to avoid overfitting. 


57.5 Usage 

Recently, gradient boosting has gained some popularity in 
the field of learning to rank. The commercial web search 
engines Yahoo 1121 and Yandex 1 131 use variants of gradient 
boosting in their machine-learned ranking engines. 
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57.6 Names [ 12] Cossock, David and Zhang, Tong (2008). Statistical Anal¬ 

ysis of Bayes Optimal Subset Ranking, page 14. 


The method goes by a variety of names. Friedman in¬ 
troduced his regression technique as a “Gradient Boost¬ 
ing Machine” (GBM). [2] Mason, Baxter et. el. described 
the generalized abstract class of algorithms as “functional 
gradient boosting”. 141151 

A popular open-source implementation 1101 for R calls it 
“Generalized Boosting Model”. Commercial implemen¬ 
tations from Salford Systems use the names “Multiple 
Additive Regression Trees” (MART) and TreeNet, both 
trademarked. 


[13] Yandex corporate blog entry about new ranking model 
“Snezhinsk” (in Russian) 


57.7 See also 

• AdaBoost 

• Random forest 
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Chapter 58 

Semi-supervised learning 




An example of the influence of unlabeled data in semi-supervised 
learning. The top panel shows a decision boundary we might 
adopt after seeing only one positive (white circle) and one nega¬ 
tive (black circle) example. The bottom panel shows a decision 
boundary we might adopt if, in addition to the two labeled exam¬ 
ples, we were given a collection of unlabeled data (gray circles). 
This could be viewed as performing clustering and then labeling 
the clusters with the labeled data, pushing the decision bound¬ 
ary away from high-density regions, or learning an underlying 
one-dimensional manifold where the data reside. 

Semi-supervised learning is a class of supervised learn¬ 
ing tasks and techniques that also make use of unlabeled 
data for training - typically a small amount of labeled data 
with a large amount of unlabeled data. Semi-supervised 
learning falls between unsupervised learning (without any 
labeled training data) and supervised learning (with com¬ 
pletely labeled training data). Many machine-learning 
researchers have found that unlabeled data, when used 
in conjunction with a small amount of labeled data, can 
produce considerable improvement in learning accuracy. 
The acquisition of labeled data for a learning problem of¬ 
ten requires a skilled human agent (e.g. to transcribe an 
audio segment) or a physical experiment (e.g. determin¬ 


ing the 3D structure of a protein or determining whether 
there is oil at a particular location). The cost associated 
with the labeling process thus may render a fully labeled 
training set infeasible, whereas acquisition of unlabeled 
data is relatively inexpensive. In such situations, semi- 
supervised learning can be of great practical value. Semi- 
supervised learning is also of theoretical interest in ma¬ 
chine learning and as a model for human learning. 

As in the supervised learning framework, we are given 
a set of l independently identically distributed examples 
xi,xi G X with corresponding labels yi, - ■ ■ ,yi G 
Y . Additionally, we are given u unlabeled examples 
:c; + i,..., Xi +U G X . Semi-supervised learning at¬ 
tempts to make use of this combined information to sur¬ 
pass the classification performance that could be obtained 
either by discarding the unlabeled data and doing super¬ 
vised learning or by discarding the labels and doing un¬ 
supervised learning. 

Semi-supervised learning may refer to either transductive 
learning or inductive learning. The goal of transductive 
learning is to infer the correct labels for the given unla¬ 
beled data Xi+\,..., Xi+ U only. The goal of inductive 
learning is to infer the correct mapping from X to Y . 

Intuitively, we can think of the learning problem as an 
exam and labeled data as the few example problems that 
the teacher solved in class. The teacher also provides a set 
of unsolved problems. In the transductive setting, these 
unsolved problems are a take-home exam and you want 
to do well on them in particular. In the inductive setting, 
these are practice problems of the sort you will encounter 
on the in-class exam. 

It is unnecessary (and, according to Vapnik’s principle, 
imprudent) to perform transductive learning by way of 
inferring a classification rule over the entire input space; 
however, in practice, algorithms formally designed for 
transduction or induction are often used interchangeably. 
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58.1 Assumptions used in semi- 
supervised learning 

In order to make any use of unlabeled data, we must 
assume some structure to the underlying distribution of 
data. Semi-supervised learning algorithms make use of 
at least one of the following assumptions. 111 

58.1.1 Smoothness assumption 

Points which are close to each other are more likely to 
share a label. This is also generally assumed in supervised 
learning and yields a preference for geometrically sim¬ 
ple decision boundaries. In the case of semi-supervised 
learning, the smoothness assumption additionally yields 
a preference for decision boundaries in low-density re¬ 
gions, so that there are fewer points close to each other 
but in different classes. 

58.1.2 Cluster assumption 

The data tend to form discrete clusters, and points in the 
same cluster are more likely to share a label (although data 
sharing a label may be spread across multiple clusters). 
This is a special case of the smoothness assumption and 
gives rise to feature learning with clustering algorithms. 

58.1.3 Manifold assumption 

The data lie approximately on a manifold of much lower 
dimension than the input space. In this case we can at¬ 
tempt to learn the manifold using both the labeled and 
unlabeled data to avoid the curse of dimensionality. Then 
learning can proceed using distances and densities de¬ 
fined on the manifold. 

The manifold assumption is practical when high¬ 
dimensional data are being generated by some process 
that may be hard to model directly, but which only has 
a few degrees of freedom. For instance, human voice is 
controlled by a few vocal folds, [21 and images of various 
facial expressions are controlled by a few muscles. We 
would like in these cases to use distances and smoothness 
in the natural space of the generating problem, rather than 
in the space of all possible acoustic waves or images re¬ 
spectively. 

58.2 History 

The heuristic approach of self-training (also known as 
self-learning or self-labeling) is historically the oldest ap¬ 
proach to semi-supervised learning,^ 1 with examples of 
applications starting in the 1960s (see for instance Scud- 
der (1965) [3] ). 


The transductive learning framework was formally intro¬ 
duced by Vladimir Vapnik in the 1970s. 141 Interest in in¬ 
ductive learning using generative models also began in the 
1970s. A probably approximately correct learning bound 
for semi-supervised learning of a Gaussian mixture was 
demonstrated by Ratsaby and Venkatesh in 1995 151 

Semi-supervised learning has recently become more pop¬ 
ular and practically relevant due to the variety of prob¬ 
lems for which vast quantities of unlabeled data are 
available—e.g. text on websites, protein sequences, or 
images. For a review of recent work see a survey article 
by Zhu (2008). [6] 

58.3 Methods for semi-supervised 
learning 

58.3.1 Generative models 

Generative approaches to statistical learning first seek to 
estimate p(x\y) , the distribution of data points belonging 
to each class. The probability p{y\x) that a given point x 
has label y is then proportional to p(x\y)p(y) by Bayes’ 
rule. Semi-supervised learning with generative models 
can be viewed either as an extension of supervised learn¬ 
ing (classification plus information about p{x) ) or as an 
extension of unsupervised learning (clustering plus some 
labels). 

Generative models assume that the distributions take 
some particular form p(x\y, 9) parameterized by the vec¬ 
tor 9 . If these assumptions are incorrect, the unla¬ 
beled data may actually decrease the accuracy of the so¬ 
lution relative to what would have been obtained from 
labeled data alone. 171 However, if the assumptions are 
correct, then the unlabeled data necessarily improves 
performance. 151 

The unlabeled data are distributed according to a mix¬ 
ture of individual-class distributions. In order to learn 
the mixture distribution from the unlabeled data, it must 
be identifiable, that is, different parameters must yield 
different summed distributions. Gaussian mixture distri¬ 
butions are identifiable and commonly used for generative 
models. 

The parameterized joint distribution can be written as 
p(x,y\9) = p(y\9)p(x\y,9) by using the Chain rule. 
Each parameter vector 0 is associated with a decision 
function fg(x) = argmax p(y\x, 9) . The parameter is 

v 

then chosen based on fit to both the labeled and unlabeled 
data, weighted by A : 


argmax (logp({^, yi} l i=1 \9) + Alogp({a; i }'±]' +1 |6')) 
e 

[ 8 ] 


358 


CHAPTER 58. SEMI-SUPERVISED LEARNING 


58.3.2 Low-density separation 

Another major class of methods attempts to place bound¬ 
aries in regions where there are few data points (labeled or 
unlabeled). One of the most commonly used algorithms 
is the transductive support vector machine, or TSVM 
(which, despite its name, may be used for inductive learn¬ 
ing as well). Whereas support vector machines for su¬ 
pervised learning seek a decision boundary with maximal 
margin over the labeled data, the goal of TSVM is a label¬ 
ing of the unlabeled data such that the decision boundary 
has maximal margin over all of the data. In addition to 
the standard hinge loss (1 — yf{x))+ for labeled data, a 
loss function (1 — |/(x)|) + is introduced over the unla¬ 
beled data by letting y = sign /( x) . TSVM then selects 
f*(x) = h*(x) + b from a reproducing kernel Hilbert 
space TL by minimizing the regularized empirical risk: 


and intrinsic spaces respectively. The graph is used to ap¬ 
proximate the intrinsic regularization term. Defining the 
graph Laplacian L = D — W where Da = W, :/ 

and f the vector [f(x i)... f(xi+ u )] , we have 

l-\-u 

f T Lf = ]T WiiUi ~ fj? 

i,j =1 

The Laplacian can also be used to extend the supervised 
learning algorithms: regularized least squares and sup¬ 
port vector machines (SVM) to semi-supervised versions 
Laplacian regularized least squares and Laplacian SVM. 

58.3.4 Heuristic approaches 


M f(x)\\ 2 dp{x) 


) M 


f* = argmin 

/ 


~ Vif{ x i)) + + Ai||(l||^ + A 2 


vi=l 


An exact solution is intractable due to the non-convex 
term (1 — |/(x)|)+ , so research has focused on finding 
useful approximations. 181 


Some methods for semi-supervised learning are not in- 
l +y trinsically geared to learning from both unlabeled and la- 
bfeled|<i:(tfi( lliit irlstead make use of unlabeled data within 
*=*+!a supervised learning framework. For instance, the la¬ 
beled and unlabeled examples xi, ..., x/ + „ may inform 
a choice of representation, distance metric, or kernel for 
the data in an unsupervised first step. Then supervised 
learning proceeds from only the labeled examples. 


Other approaches that implement low-density separation 
include Gaussian process models, information regulariza¬ 
tion, and entropy minimization (of which TSVM is a spe¬ 
cial case). 

58.3.3 Graph-based methods 


Self-training is a wrapper method for semi-supervised 
learning. First a supervised learning algorithm is used 
to select a classifier based on the labeled data only. This 
classifier is then applied to the unlabeled data to generate 
more labeled examples as input for another supervised 
learning problem. Generally only the labels the classifier 
is most confident of are added at each step. 


Graph-based methods for semi-supervised learning use 
a graph representation of the data, with a node for each 
labeled and unlabeled example. The graph may be con¬ 
structed using domain knowledge or similarity of exam¬ 
ples; two common methods are to connect each data point 
to its k nearest neighbors or to examples within some dis¬ 
tance e . The weight Wij of an edge between x, and Xj 

-llXj-Xj II 2 

is then set to e - . 

Within the framework of manifold regularization , 191 1101 
the graph serves as a proxy for the manifold. A term 
is added to the standard Tikhonov regularization prob¬ 
lem to enforce smoothness of the solution relative to the 
manifold (in the intrinsic space of the problem) as well 
as relative to the ambient input space. The minimization 
problem becomes 


argmin 

/GW 

[8] 


i + a^II/IIh + a / 


i=i 



Co-training is an extension of self-training in which mul¬ 
tiple classifiers are trained on different (ideally disjoint) 
sets of features and generate labeled examples for one an¬ 
other. 


58.4 Semi-supervised learning in 
human cognition 


Human responses to formal semi-supervised learning 
problems have yielded varying conclusions about the de¬ 
gree of influence of the unlabeled data (for a summary 
see |11] ). More natural learning problems may also be 
viewed as instances of semi-supervised learning. Much 
of human concept learning involves a small amount of 
direct instruction/e.g. parental labeling of objects dur- 
11 ^^c^y|||M)||j^yiibined with large amounts of unlabeled 
experience (e.g. observation of objects without naming 
or counting them, or at least without feedback). 


Human infants are sensitive to the structure of unlabeled 
where TL is a reproducing kernel Hilbert space and A4 natural categories such as images of dogs and cats or male 
is the manifold on which the data lie. The regularization and female faces. 1121 More recent work has shown that in¬ 
parameters A^ and A j control smoothness in the ambient fants and children take into account not only the unlabeled 
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examples available, but the sampling process from which 
labeled examples arise. 11311141 


58.5 See also 

• PU learning 
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58.7 External links 

• A freely available MATLAB implementation of the 
graph-based semi-supervised algorithms Laplacian 
support vector machines and Laplacian regularized 
least squares. 


Chapter 59 

Perceptron 


“Perceptrons” redirects here. For the book of that title, 
see Perceptrons (book). 

In machine learning, the perceptron is an algorithm for 
supervised learning of binary classifiers: functions that 
can decide whether an input (represented by a vector of 
numbers) belong to one class or another. 111 It is a type of 
linear classifier, i.e. a classification algorithm that makes 
its predictions based on a linear predictor function com¬ 
bining a set of weights with the feature vector. The algo¬ 
rithm allows for online learning, in that it processes ele¬ 
ments in the training set one at a time. 

The perceptron algorithm dates back to the late 1950s; its 
first implementation, in custom hardware, was one of the 
first artificial neural networks to be produced. 

59.1 History 

See also: History of artificial intelligence, AI 
winter 

The perceptron algorithm was invented in 1957 at the 
Cornell Aeronautical Laboratory by Frank Rosenblatt, 121 
funded by the United States Office of Naval Research. 131 
The perceptron was intended to be a machine, rather than 
a program, and while its first implementation was in soft¬ 
ware for the IBM 704, it was subsequently implemented 
in custom-built hardware as the “Mark 1 perceptron”. 
This machine was designed for image recognition: it had 
an array of 400 photocells, randomly connected to the 
“neurons”. Weights were encoded in potentiometers, and 
weight updates during learning were performed by elec¬ 
tric motors. [4,: 193 

In a 1958 press conference organized by the US Navy, 
Rosenblatt made statements about the perceptron that 
caused a heated controversy among the fledgling AI com¬ 
munity; based on Rosenblatt’s statements. The New York 
Times reported the perceptron to be “the embryo of an 
electronic computer that [the Navy] expects will be able 
to walk, talk, see, write, reproduce itself and be conscious 
of its existence.” 131 

Although the perceptron initially seemed promising, it 


was quickly proved that perceptrons could not be trained 
to recognise many classes of patterns. This led to the field 
of neural network research stagnating for many years, 
before it was recognised that a feedforward neural net¬ 
work with two or more layers (also called a multilayer 
perceptron) had far greater processing power than per¬ 
ceptrons with one layer (also called a single layer percep¬ 
tron). Single layer perceptrons are only capable of learn¬ 
ing linearly separable patterns; in 1969 a famous book en¬ 
titled Perceptrons by Marvin Minsky and Seymour Papert 
showed that it was impossible for these classes of network 
to learn an XOR function. It is often believed that they 
also conjectured (incorrectly) that a similar result would 
hold for a multi-layer perceptron network. However, this 
is not true, as both Minsky and Papert already knew 
that multi-layer perceptrons were capable of producing 
an XOR function. (See the page on Perceptrons (book) 
for more information.) Three years later Stephen Gross- 
berg published a series of papers introducing networks 
capable of modelling differential, contrast-enhancing and 
XOR functions. (The papers were published in 1972 
and 1973, see e.g.:Grossberg (1973). “Contour enhance¬ 
ment, short-term memory, and constancies in reverber¬ 
ating neural networks” (PDF). Studies in Applied Math¬ 
ematics 52: 213-257.). Nevertheless the often-miscited 
Minsky/Papert text caused a significant decline in inter¬ 
est and funding of neural network research. It took ten 
more years until neural network research experienced a 
resurgence in the 1980s. This text was reprinted in 1987 
as “Perceptrons - Expanded Edition” where some errors 
in the original text are shown and corrected. 

The kernel perceptron algorithm was already introduced 
in 1964 by Aizerman et alJ 51 Margin bounds guaran¬ 
tees were given for the Perceptron algorithm in the gen¬ 
eral non-separable case first by Freund and Schapire 
(1998), 1 11 and more recently by Mohri and Rostamizadeh 
(2013) who extend previous results and give new LI 
bounds. 161 


59.2 Definition 

In the modern sense, the perceptron is an algorithm for 
learning a binary classifier: a function that maps its input 
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x (a real-valued vector) to an output value f(x) (a single 
binary value): 


f(x) 


1 ifro • x + b > 0 
0 otherwise 


where w is a vector of real-valued weights, w ■ x is the 
dot product WiXi, and b is the bias, a term that shifts 
the decision boundary away from the origin and does not 
depend on any input value. 

The value of f(x) (0 or 1) is used to classify x as ei¬ 
ther a positive or a negative instance, in the case of a 
binary classification problem. If b is negative, then the 
weighted combination of inputs must produce a positive 
value greater than |6| in order to push the classifier neu¬ 
ron over the 0 threshold. Spatially, the bias alters the posi¬ 
tion (though not the orientation) of the decision boundary. 
The perceptron learning algorithm does not terminate if 
the learning set is not linearly separable. If the vectors are 
not linearly separable learning will never reach a point 
where all vectors are classified properly. The most fa¬ 
mous example of the perceptron’s inability to solve prob¬ 
lems with linearly nonseparable vectors is the Boolean 
exclusive-or problem. The solution spaces of decision 
boundaries for all binary functions and learning behav¬ 
iors are studied in the reference. [71 

In the context of neural networks, a perceptron is an 
artificial neuron using the Heaviside step function as the 
activation function. The perceptron algorithm is also 
termed the single-layer perceptron, to distinguish it 
from a multilayer perceptron, which is a misnomer for a 
more complicated neural network. As a linear classifier, 
the single-layer perceptron is the simplest feedforward 
neural network. 



A diagram showing a perceptron updating its linear boundary as 
more training examples are added. 


• V = /( z ) denotes the output from the perceptron 
for an input vector z . 

• b is the bias term, which in the example below we 
take to be 0. 

• D = {(xi, di), ..., (x s , d s )} is the training set of 
s samples, where: 

• Xj is the n -dimensional input vector. 

• dj is the desired output value of the percep¬ 
tron for that input. 

We show the values of the features as follows: 

• Xj t i is the value of the i th feature of the j th training 
input vector. 


59.3 Learning algorithm 


• Xjfi = 1 • 


Below is an example of a learning algorithm for a (single¬ 
layer) perceptron. For multilayer perceptrons, where a 
hidden layer exists, more sophisticated algorithms such 
as backpropagation must be used. Alternatively, meth¬ 
ods such as the delta rule can be used if the function is 
non-linear and differentiable, although the one below will 
work as well. 

When multiple perceptrons are combined in an artificial 
neural network, each output neuron operates indepen¬ 
dently of all the others; thus, learning each output can 
be considered in isolation. 


To represent the weights: 

• Wi is the i th value in the weight vector, to be mul¬ 
tiplied by the value of the i th input feature. 

• Because Xj ,o = 1 , the wq is effectively a learned 
bias that we use instead of the bias constant b . 

To show the time-dependence of w , we use: 

• Wi(t) is the weight i at time t. 

• a is the learning rate, where 0 < a < 1 . 


59.3.1 Definitions Too high a learning rate makes the perceptron periodi¬ 

cally oscillate around the solution unless additional steps 
We first define some variables: are taken. 
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The appropriate weights are applied to the inputs, and the result¬ 
ing weighted sum passed to a function that produces the output 

y- 


59.3.2 Steps 

1. Initialize the weights and the threshold. Weights may 
be initialized to 0 or to a small random value. In the ex¬ 
ample below, we use 0 . 

2. For each example j in our training set D , perform 
the following steps over the input \ :l and desired output 

dj : 


2a. Calculate the actual output: 

Vj(t) = /[w(f)-Xy] = f[wo(t)+W 1 (t)x ji i+W 2 (t)x j ,2+- 

2b. Update the weights: 

Wi(t + 1 ) = Wi(t) + a(dj — 

llj(t))xj,i , for all feature 0 < * < 
n . 

3. For offline learning, the step 2 may be repeated until 
the iteration error f | dj—x/j ( t ) | is less than a user- 

specified error threshold 7 , or a predetermined number 
of iterations have been completed. 

The algorithm updates the weights after steps 2a and 2b. 
These weights are immediately applied to a pair in the 
training set, and subsequently updated, rather than wait¬ 
ing until all pairs in the training set have undergone these 
steps. 


59.3.3 Convergence 

The perceptron is a linear classifier, therefore it will never 
get to the state with all the input vectors classified cor¬ 
rectly if the training set D is not linearly separable, i.e. 
if the positive examples can not be separated from the 
negative examples by a hyperplane. In this case, no “ap¬ 
proximate” solution will be gradually approached under 
the standard learning algorithm, but instead learning will 
fail completely. Hence, if linear separability of the train¬ 
ing set is not known a priori, one of the training variants 
below should be used. 


But if the training set is linearly separable, then the per¬ 
ceptron is guaranteed to converge, and there is an upper 
bound on the number of times the perceptron will adjust 
its weights during the training. 

Suppose that the input vectors from the two classes can 
be separated by a hyperplane with a margin 7 , i.e. there 
exists a weight vector w, 11 w| | = 1, and a bias term b such 
that w• Xj + b > 7 for all j : dj = 1 and yv-Xj+b < —7 
for all j : dj = 0 . And also let R denote the maxi¬ 
mum norm of an input vector. Novikoff (1962) proved 
that in this case the perceptron algorithm converges af¬ 
ter making 0(R 2 / 7 2 ) updates. The idea of the proof is 
that the weight vector is always adjusted by a bounded 
amount in a direction that it has a negative dot product 
with, and thus can be bounded above by 0(y/i) where t 
is the number of changes to the weight vector. But it can 
also be bounded below by 0{t) because if there exists an 
(unknown) satisfactory weight vector, then every change 
makes progress in this (unknown) direction by a positive 
amount that depends only on the input vector. 



Two classes of points, and two of the infinitely many linear 
boundaries that separate them. Even though the boundaries are 
at nearly right angles to one another, the perceptron algorithm 
has no way of choosing between them. 


While the perceptron algorithm is guaranteed to converge 
on some solution in the case of a linearly separable train¬ 
ing set, it may still pick any solution and problems may 
admit many solutions of varying quality . 181 The percep¬ 
tron of optimal stability, nowadays better known as the 
linear support vector machine, was designed to solve this 
problem. 

The decision boundary of a perceptron is invariant with 
respect to scaling of the weight vector; that is, a percep¬ 
tron trained with initial weight vector w and learning rate 
a behaves identically to a perceptron trained with initial 
weight vector yv/a and learning rate 1. Thus, since the 
initial weights become irrelevant with increasing number 
of iterations, the learning rate does not matter in the case 
of the perceptron and is usually just set to 1 . 
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59.4 Variants 

The pocket algorithm with ratchet (Gallant, 1990) solves 
the stability problem of perceptron learning by keep¬ 
ing the best solution seen so far “in its pocket”. The 
pocket algorithm then returns the solution in the pocket, 
rather than the last solution. It can be used also for non- 
separable data sets, where the aim is to find a percep¬ 
tron with a small number of misclassifications. However, 
these solutions appear purely stochastically and hence the 
pocket algorithm neither approaches them gradually in 
the course of learning, nor are they guaranteed to show 
up within a given number of learning steps. 

The Maxover algorithm (Wendemuth, 1995) |y| is 
“robust” in the sense that it will converge regardless of 
(prior) knowledge of linear separability of the data set. In 
the linear separable case, it will solve the training problem 
- if desired, even with optimal stability (maximum mar¬ 
gin between the classes). For non-separable data sets, it 
will return a solution with a small number of misclassifi¬ 
cations. In all cases, the algorithm gradually approaches 
the solution in the course of learning, without memoriz¬ 
ing previous states and without stochastic jumps. Con¬ 
vergence is to global optimality for separable data sets 
and to local optimality for non-separable data sets. 

In separable problems, perceptron training can also aim at 
finding the largest separating margin between the classes. 
The so-called perceptron of optimal stability can be de¬ 
termined by means of iterative training and optimization 
schemes, such as the Min-Over algorithm (Krauth and 
Mezard, 1987) [101 or the AdaTron (Anlauf and Biehl, 
1989)) 3 U| AdaTron uses the fact that the corresponding 
quadratic optimization problem is convex. The percep¬ 
tron of optimal stability, together with the kernel trick, 
are the conceptual foundations of the support vector ma¬ 
chine. 

The a -perceptron further used a pre-processing layer of 
fixed random weights, with thresholded output units. This 
enabled the perceptron to classify analogue patterns, by 
projecting them into a binary space. In fact, for a pro¬ 
jection space of sufficiently high dimension, patterns can 
become linearly separable. 

For example, consider the case of having to classify data 
into two classes. Here is a small such data set, consisting 
of points coming from two Gaussian distributions. 

• Two-class Gaussian data 

• A linear classifier operating on the original space 

• A linear classifier operating on a high-dimensional 
projection 

A linear classifier can only separate points with a 
hyperplane, so no linear classifier can classify all the 
points here perfectly. On the other hand, the data can 
be projected into a large number of dimensions. In our 


example, a random matrix was used to project the data 
linearly to a 1000-dimensional space; then each resulting 
data point was transformed through the hyperbolic tan¬ 
gent function. A linear classifier can then separate the 
data, as shown in the third figure. However the data may 
still not be completely separable in this space, in which 
the perceptron algorithm would not converge. In the ex¬ 
ample shown, stochastic steepest gradient descent was 
used to adapt the parameters. 

Another way to solve nonlinear problems without using 
multiple layers is to use higher order networks (sigma-pi 
unit). In this type of network, each element in the in¬ 
put vector is extended with each pairwise combination of 
multiplied inputs (second order). This can be extended 
to an u-order network. 

It should be kept in mind, however, that the best classifier 
is not necessarily that which classifies all the training data 
perfectly. Indeed, if we had the prior constraint that the 
data come from equi-variant Gaussian distributions, the 
linear separation in the input space is optimal, and the 
nonlinear solution is overfitted. 

Other linear classification algorithms include Winnow, 
support vector machine and logistic regression. 


59.5 Example 

A perceptron learns to perform a binary NAND function 
on inputs X\ and x-i . 

Inputs: Xq , x \ ,Xn , with input xq held constant at 1. 
Threshold ( t ): 0.5 
Bias ( b ): 1 
Learning rate ( r ): 0.1 

Training set, consisting of four samples: 

{(( 1 , 0 , 0 ), 1 ), (( 1 , 0 , 1 ), 1 ), (( 1 , 1 , 0 ), 1 ), (( 1 , 1 , 1 ), 0 )} 

In the following, the final weights of one iteration become 
the initial weights of the next. Each cycle over all the 
samples in the training set is demarcated with heavy lines. 

This example can be implemented in the following 
Python code. 

threshold = 0.5 learning_rate = 0.1 weights = [0, 0, 
0] training_set = [((1, 0, 0), 1), ((1, 0, 1), 1), ((1, 1, 
0), 1), ((1, 1, 1), 0)] def dot_product(values, weights): 
return sum(value * weight for value, weight in zip(values, 
weights)) while True: print('-' * 60) error_count = 
0 for input_vector, desired_output in training_set: 
print(weights) result = dot_product(input_vector, 
weights) > threshold error = desired_output - result if 
error != 0: error_count += 1 for index, value in enu- 
merate(input_vector): weights [index] += learning_rate * 
error * value if error_count == 0: break 


364 


CHAPTER 59. PERCEPTRON 


59.6 Multiclass perceptron 

Like most other techniques for training linear classifiers, 
the perceptron generalizes naturally to multiclass classi¬ 
fication. Here, the input x and the output y are drawn 
from arbitrary sets. A feature representation function 
f(x , y ) maps each possible input/output pair to a finite¬ 
dimensional real-valued feature vector. As before, the 
feature vector is multiplied by a weight vector w , but 
now the resulting score is used to choose among many 
possible outputs: 

y = argmax y f(x, y) ■ w. 

Learning again iterates over the examples, predicting an 
output for each, leaving the weights unchanged when the 
predicted output matches the target, and changing them 
when it does not. The update becomes: 

w t +1 = Wt + fix, y) - fix, y). 

This multiclass formulation reduces to the original per¬ 
ceptron when x is a real-valued vector, y is chosen from 
{0,1} , and fix, y) = yx . 

For certain problems, input/output representations and 
features can be chosen so that argma x fix, y) ■ w can 
be found efficiently even though y is chosen from a very 
large or even infinite set. 

In recent years, perceptron training has become popular 
in the field of natural language processing for such tasks 
as part-of-speech tagging and syntactic parsing (Collins, 
2002 ). 
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59.8 External links 

• A Perceptron implemented in MATLAB to learn bi¬ 
nary NAND function 

• Chapter 3 Weighted networks - the perceptron and 
chapter 4 Perceptron learning of Neural Networks - 
A Systematic Introduction by Raul Rojas (ISBN 978- 
3-540-60505-8) 

• Explanation of the update rule by Charles Elkan 

• History of perceptrons 

• Mathematics of perceptrons 


Chapter 60 

Support vector machine 


Not to be confused with Secure Virtual Machine. 

In machine learning, support vector machines (SVMs, 
also support vector networks 111 ) are supervised learn¬ 
ing models with associated learning algorithms that an¬ 
alyze data and recognize patterns, used for classification 
and regression analysis. Given a set of training examples, 
each marked for belonging to one of two categories, an 
SVM training algorithm builds a model that assigns new 
examples into one category or the other, making it a non- 
probabilistic binary linear classifier. An SVM model is a 
representation of the examples as points in space, mapped 
so that the examples of the separate categories are divided 
by a clear gap that is as wide as possible. New examples 
are then mapped into that same space and predicted to 
belong to a category based on which side of the gap they 
fall on. 

In addition to performing linear classification, SVMs can 
efficiently perform a non-linear classification using what 
is called the kernel trick, implicitly mapping their inputs 
into high-dimensional feature spaces. 


60.1 Definition 

More formally, a support vector machine constructs a 
hyperplane or set of hyperplanes in a high- or infinite¬ 
dimensional space, which can be used for classification, 
regression, or other tasks. Intuitively, a good separation 
is achieved by the hyperplane that has the largest distance 
to the nearest training-data point of any class (so-called 
functional margin), since in general the larger the margin 
the lower the generalization error of the classifier. 

Whereas the original problem may be stated in a finite di¬ 
mensional space, it often happens that the sets to discrim¬ 
inate are not linearly separable in that space. For this rea¬ 
son, it was proposed that the original finite-dimensional 
space be mapped into a much higher-dimensional space, 
presumably making the separation easier in that space. 
To keep the computational load reasonable, the mappings 
used by SVM schemes are designed to ensure that dot 
products may be computed easily in terms of the variables 
in the original space, by defining them in terms of a kernel 


function k(x,y) selected to suit the problem. 121 The hy¬ 
perplanes in the higher-dimensional space are defined as 
the set of points whose dot product with a vector in that 
space is constant. The vectors defining the hyperplanes 
can be chosen to be Unear combinations with parame¬ 
ters on of images of feature vectors Xi that occur in the 
data base. With this choice of a hyperplane, the points x 
in the feature space that are mapped into the hyperplane 
are defined by the relation: JV ceik(xi,x) = constant. 
Note that if k(x,y ) becomes small as y grows further 
away from x , each term in the sum measures the degree 
of closeness of the test point x to the corresponding data 
base point Xi . In this way, the sum of kernels above can 
be used to measure the relative nearness of each test point 
to the data points originating in one or the other of the sets 
to be discriminated. Note the fact that the set of points 
x mapped into any hyperplane can be quite convoluted 
as a result, allowing much more complex discrimination 
between sets which are not convex at all in the original 
space. 

60.2 History 

The original SVM algorithm was invented by Vladimir N. 
Vapnik and Alexey Ya. Chervonenkis in 1963. In 1992, 
Bernhard E. Boser, Isabelle M. Guyon and Vladimir 
N. Vapnik suggested a way to create nonUnear classi¬ 
fiers by applying the kernel trick to maximum-margin 
hyperplanes. 131 The current standard incarnation (soft 
margin) was proposed by Corinna Cortes and Vapnik in 
1993 and published in 1995. [1] 

60.3 Motivation 

Classifying data is a common task in machine learning. 
Suppose some given data points each belong to one of two 
classes, and the goal is to decide which class a new data 
point will be in. In the case of support vector machines, a 
data point is viewed as a p -dimensional vector (a list of p 
numbers), and we want to know whether we can separate 
such points with a (p— 1) -dimensional hyperplane. This 
is called a linear classifier. There are many hyperplanes 
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H i does not separate the classes. Hz does, but only with a small 
margin. H 3 separates them with the maximum margin. 

that might classify the data. One reasonable choice as the 
best hyperplane is the one that represents the largest sepa¬ 
ration, or margin, between the two classes. So we choose 
the hyperplane so that the distance from it to the nearest 
data point on each side is maximized. If such a hyper¬ 
plane exists, it is known as the maximum-margin hyper¬ 
plane and the linear classifier it defines is known as a max¬ 
imum margin classifier, or equivalently, the perceptron of 
optimal stability. 

60.4 Linear SVM 

Given some training data V , a set of n points of the form 

£* = {(xi, yi) | Xi e R p , yi e {-1, i}}" = i 

where the y, : is either 1 or -1, indicating the class to which 
the point x., belongs. Each x, ; is a p -dimensional real 
vector. We want to find the maximum-margin hyperplane 
that divides the points having yt = 1 from those having 
yi = — 1 . Any hyperplane can be written as the set of 
points x satisfying 


w • x — b = 0, 

where • denotes the dot product and w the (not neces¬ 
sarily normalized) normal vector to the hyperplane. The 
parameter -p^y determines the offset of the hyperplane 
from the origin along the normal vector w . 

If the training data are linearly separable, we can select 
two hyperplanes in a way that they separate the data and 
there are no points between them, and then try to maxi¬ 
mize their distance. The region bounded by them is called 
“the margin”. These hyperplanes can be described by the 
equations 



Maximum-margin hyperplane and margins for an SVM trained 
with samples from two classes. Samples on the margin are called 
the support vectors. 

w • x — b = 1 
and 

w • x — b = —1. 

By using geometry, we find the distance between these 
two hyperplanes is ypyy , so we want to minimize ||w|| . 
As we also have to prevent data points from falling into 
the margin, we add the following constraint: for each i 
either 

w • Xj — b > 1 for Xi 
or 

wx-K-l for Xi 
This can be rewritten as: 

y t (w • Xj — b) > 1, all for 1 < * < n. (1) 

We can put this together to get the optimization problem: 
Minimize (in w, b ) 

IH| 

subject to (for any i. = 1,..., n ) 

Vi (w ■ Xj -b)> 1. 







368 


CHAPTER 60. SUPPORT VECTOR MACHINE 


60.4.1 Primal form 


60.4.2 Dual form 


The optimization problem presented in the preceding sec¬ 
tion is difficult to solve because it depends on ||w|| , the 
norm of w , which involves a square root. Fortunately it 
is possible to alter the equation by substituting ||w|| with 
11 w 11 2 (the factor of ^ being used for mathematical con¬ 
venience) without changing the solution (the minimum of 
the original and the modified equation have the same w 
and b ). This is a quadratic programming optimization 
problem. More clearly: 


Writing the classification rule in its unconstrained dual 
form reveals that the maximum-margin hyperplane and 
therefore the classification task is only a function of the 
support vectors, the subset of the training data that lie on 
the margin. 

Using the fact that ||w|| 2 = w T ■ w and substituting w = 
YZi =i o:,y,Xi , one can show that the dual of the SVM 
reduces to the following optimization problem: 

Maximize (in at ) 


argmin -||w|| 2 
(w ,b) Z 

subject to (for any i = 1 ,..., n ) 
yi{ W • Xi - b) > 1. 


L(a) — 'y ' ctj ^ y ] onajyiyjXi Xj — y ] ati ^ y ' otiOijyiyjk(xi , Xj 

i —1 i,j i —1 i,j 

subject to (for any i = 1, ..., n ) 


By introducing Lagrange multipliers a , the previous ai > q 
constrained problem can be expressed as 

and to the constraint from the minimization in b 


argmin max < - ||w|| 2 - V a^y^yv ■ x-, - b) - 1] \ 
w ,b a>0 Z z ' 

~ K i= 1 ) 

that is we look for a saddle point. In doing so all the points 
which can be separated as yi (w • Xj — b) — 1 > 0 do not 
matter since we must set the corresponding ai to zero. 

This problem can now be solved by standard quadratic 
programming techniques and programs. The “stationary” 
Karush-Kuhn-Tucker condition implies that the solution 
can be expressed as a linear combination of the training 
vectors 


n 

y aMi= o. 

i= 1 

Here the kernel is defined by /c(x 2 , Xj) = x 2 • x 3 . 
W can be computed thanks to the a terms: 


w = 


y^iViXi- 

i 


n 

w = y aiytx i. 

i—l 

Only a few a, will be greater than zero. The correspond¬ 
ing Xi are exactly the support vectors, which lie on the 
margin and satisfy t/j (w • Xj — b) = 1. From this one can 
derive that the support vectors also satisfy 


w • xi - b = — = yi <^=> 6 = w ■ xi — y t 
Vi 

which allows one to define the offset b . The b depends 
on yi and Xi , so it will vary for each data point in the 
sample. In practice, it is more robust to average over all 
Nsv support vectors, since the average over the sample 
is an unbiased estimator of the population mean: 


60.4.3 Biased and unbiased hyperplanes 

For simplicity reasons, sometimes it is required that the 
hyperplane pass through the origin of the coordinate sys¬ 
tem. Such hyperplanes are called unbiased, whereas gen¬ 
eral hyperplanes not necessarily passing through the ori¬ 
gin are called biased. An unbiased hyperplane can be en¬ 
forced by setting b = 0 in the primal optimization prob¬ 
lem. The corresponding dual is identical to the dual given 
above without the equality constraint 


y onyi = 0 

i =1 

60.5 Soft margin 


b = 


l 

N S v 


N sv 

y ( w 


2 — 1 


Xi -Vi) 


In 1995, Corinna Cortes and Vladimir N. Vapnik sug¬ 
gested a modified maximum margin idea that allows for 
mislabeled examples. [1] If there exists no hyperplane that 
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can split the “yes” and “no” examples, the Soft Margin 
method will choose a hyperplane that splits the examples 
as cleanly as possible, while still maximizing the distance 
to the nearest cleanly split examples. The method intro¬ 
duces non-negative slack variables, f , which measure 
the degree of misclassification of the data x t 


n ' i 

L{a) = ^2,oti- - ^a i a j y i y j k{^ i ,'s.j) 
*= 1 i,j 

subject to (for any i = 1,..., n ) 


Vi (w • Xi — 6) > 1 — 1 <i<n. (2) 


0 < cti < C, 


The objective function is then increased by a function 
which penalizes non-zero f, , and the optimization be¬ 
comes a trade off between a large margin and a small error 
penalty. If the penalty function is linear, the optimization 
problem becomes: 


and 


n 

y (y-iVi =o. 

i =1 


arg min 

w ,£,b 


W 




subject to (for any i = 1,... n ) 


t/j(w-xi - b) > 1 &>0 

Using the hinge function notation like that in MARS, this 
optimization problem can be rewritten as JV[1 — y % (w ■ 
Xi + b)] + + A||u>|| 2 , wherein let [1 — yi{w ■ Xi + &)]+ = 
&]+=&, A = 1/2C . 


The key advantage of a linear penalty function is that 
the slack variables vanish from the dual problem, with 
the constant C appearing only as an additional constraint 
on the Lagrange multipliers. For the above formula¬ 
tion and its huge impact in practice, Cortes and Vap- 
nik received the 2008 ACM Paris Kanellakis Award A 41 
Nonlinear penalty functions have been used, particularly 
to reduce the effect of outliers on the classifier, but unless 
care is taken the problem becomes non-convex, and thus 
it is considerably more difficult to find a global solution. 


60.6 Nonlinear classification 


This constraint in (2) along with the objective of mini¬ 
mizing ||w|| can be solved using Lagrange multipliers as 
done above. One then has to solve the following problem: 


arg min max 

w ,£,6 a , 13 


lb lb 

7,ll w l| 2 + ' x i 

i= 1 i= 1 


b) 


* 


with oti , fa > 0 . 






An example for a result of soft-margin SVM 


60.5.1 Dual form 

Maximize (in cti ) 


Kernel machine 

The original optimal hyperplane algorithm proposed by 
Vapnik in 1963 was a linear classifier. However, in 1992, 
Bernhard E. Boser, Isabelle M. Guyon and Vladimir N. 
Vapnik suggested a way to create nonlinear classifiers by 
applying the kernel trick (originally proposed by Aizer¬ 
man et al. [5| j to maximum-margin hyperplanes. 161 The re¬ 
sulting algorithm is formally similar, except that every dot 
product is replaced by a nonlinear kernel function. This 
allows the algorithm to fit the maximum-margin hyper¬ 
plane in a transformed feature space. The transformation 
may be nonlinear and the transformed space high dimen¬ 
sional; thus though the classifier is a hyperplane in the 
high-dimensional feature space, it may be nonlinear in the 
original input space. 

If the kernel used is a Gaussian radial basis function, the 
corresponding feature space is a Hilbert space of infi¬ 
nite dimensions. Maximum margin classifiers are well 



gaussian kernel 
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regularized, and previously it was widely believed that the 
infinite dimensions do not spoil the results. However, it 
has been shown that higher dimensions do increase the 
generalization error, although the amount is bounded . 171 

Some common kernels include: 

• Polynomial (homogeneous): fc(xj,Xj) = (xj • xj ) d 

• Polynomial (inhomogeneous): fc(xj,xj) = (xj-xj + 

l)d 

• Gaussian radial basis function: fc(xj, xj) = 
exp(— 7 ||xj — xj 11 2 ) , for 7 > 0 . Sometimes 
parametrized using 7 = l/ 2 cr 2 

• Hyperbolic tangent: k(x\. xj) = tanh(«Xj • xj + c) , 
for some (not every) k > 0 and c < 0 

The kernel is related to the transform <p(xi) by the equa¬ 
tion k{x\, Xj) = tp(xi) • <^( x j) ■ The value w is also in 
the transformed space, with w = JV (XiUiPixi) ■ Dot 
products with w for classification can again be computed 
by the kernel trick, i.e. w • (p(x) = aiUik(xi,x) . 
However, there does not in general exist a value w' such 
that w • ip(x) = k(w', x) . 

60.7 Properties 

SVMs belong to a family of generalized linear classifiers 
and can be interpreted as an extension of the perception. 
They can also be considered a special case of Tikhonov 
regularization. A special property is that they simulta¬ 
neously minimize the empirical classification error and 
maximize the geometric margin ; hence they are also 
known as maximum margin classifiers. 

A comparison of the SVM to other classifiers has been 
made by Meyer, Leisch and Hornik. |S| 

60.7.1 Parameter selection 

The effectiveness of SVM depends on the selection of 
kernel, the kernel’s parameters, and soft margin pa¬ 
rameter C. A common choice is a Gaussian kernel, 
which has a single parameter 7 . The best combina¬ 
tion of C and 7 is often selected by a grid search 
with exponentially growing sequences of C and 7 , 
for example, C e { 2 “ 5 , 2 “ 3 ,..., 2 13 , 2 15 } ; 7 e 
{2 -15 , 2 -13 ,..., 2 1 ,2 3 } . Typically, each combination 
of parameter choices is checked using cross validation, 
and the parameters with best cross-validation accuracy 
are picked. Alternatively, recent work in Bayesian opti¬ 
mization can be used to select C and 7 , often requiring 
the evaluation of far fewer parameter combinations than 
grid search. The final model, which is used for testing 
and for classifying new data, is then trained on the whole 
training set using the selected parameters. 


60.7.2 Issues 

Potential drawbacks of the SVM are the following three 
aspects: 

• Uncalibrated class membership probabilities 

• The SVM is only directly applicable for two-class 
tasks. Therefore, algorithms that reduce the multi¬ 
class task to several binary problems have to be ap¬ 
plied; see the multi-class SVM section. 

• Parameters of a solved model are difficult to inter¬ 
pret. 

60.8 Extensions 

60.8.1 Multiclass SVM 

Multiclass SVM aims to assign labels to instances by using 
support vector machines, where the labels are drawn from 
a finite set of several elements. 

The dominant approach for doing so is to reduce the 
single multiclass problem into multiple binary classifica¬ 
tion problems. 1101 Common methods for such reduction 
include 4101 [11] 

• Building binary classifiers which distinguish be¬ 
tween (i) one of the labels and the rest ( one-versus- 
all) or (ii) between every pair of classes ( one-versus- 
one). Classification of new instances for the one- 
versus-all case is done by a winner-takes-all strat¬ 
egy, in which the classifier with the highest output 
function assigns the class (it is important that the 
output functions be calibrated to produce compara¬ 
ble scores). For the one-versus-one approach, clas¬ 
sification is done by a max-wins voting strategy, in 
which every classifier assigns the instance to one of 
the two classes, then the vote for the assigned class is 
increased by one vote, and finally the class with the 
most votes determines the instance classification. 

• Directed acyclic graph SVM (DAGSVM) [12] 

• Error-correcting output codes 1131 

Crammer and Singer proposed a multiclass SVM method 
which casts the multiclass classification problem into a 
single optimization problem, rather than decomposing it 
into multiple binary classification problems. 1141 See also 
Lee, Lin and Wahba. [15]tl6] 

60.8.2 Transductive support vector ma¬ 
chines 

Transductive support vector machines extend SVMs 
in that they could also treat partially labeled data in 
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semi-supervised learning by following the principles of 
transduction. Here, in addition to the training set V , the 
learner is also given a set 

V* = K|x* € R p }.ti 

of test examples to be classified. Formally, a transductive 
support vector machine is defined by the following primal 
optimization problem: 1171 

Minimize (in w, b, y* ) 



subject to (for any i = 1 ,. .. ,n and any j = 1 ,..., k ) 

S/*(w • Xi — 6) > 1, 

t/*(w-x7-6) > 1, 

and 

y] e {-i,i}. 

Transductive support vector machines were introduced by 
Vladimir N. Vapnik in 1998. 

60.8.3 Structured SVM 

SVMs have been generalized to structured SVMs, where 
the label space is structured and of possibly infinite size. 

60.8.4 Regression 

A version of SVM for regression was proposed in 1996 
by Vladimir N. Vapnik, Harris Drucker, Christopher J. 
C. Burges, Linda Kaufman and Alexander J. Smola. |ls| 
This method is called support vector regression (SVR). 
The model produced by support vector classification (as 
described above) depends only on a subset of the train¬ 
ing data, because the cost function for building the model 
does not care about training points that lie beyond the 
margin. Analogously, the model produced by SVR de¬ 
pends only on a subset of the training data, because the 
cost function for building the model ignores any training 
data close to the model prediction. Another SVM ver¬ 
sion known as least squares support vector machine (LS- 
SVM) has been proposed by Suykens and Vandewalle. 1 191 

Training the original SVR means solving 1201 



Hi - {w,Xi) - b < e 
(w,Xi) + b-yi < e 

where x,; is a training sample with target value ij, . The 
inner product plus intercept (w, Xj) + b is the prediction 
for that sample, and e is a free parameter that serves as 
a threshold: all predictions have to be within an e range 
of the true predictions. Slack variables are usually added 
into the above to allow for errors and to allow approxima¬ 
tion in the case the above problem is infeasible. 


60.9 Interpreting SVM models 

The SVM algorithm has been widely applied in the bi¬ 
ological and other sciences. Permutation tests based on 
SVM weights have been suggested as a mechanism for 
interpretation of SVM models. [21)[221 Support vector ma¬ 
chine weights have also been used to interpret SVM mod¬ 
els in the past. 1231 Posthoc interpretation of support vector 
machine models in order to identify features used by the 
model to make predictions is a relatively new area of re¬ 
search with special significance in the biological sciences. 


60.10 Implementation 

The parameters of the maximum-margin hyperplane 
are derived by solving the optimization. There ex¬ 
ist several specialized algorithms for quickly solving 
the QP problem that arises from SVMs, mostly re¬ 
lying on heuristics for breaking the problem down 
into smaller, more-manageable chunks. A common 
method is Platt’s sequential minimal optimization (SMO) 
algorithm, which breaks the problem down into 2- 
dimensional sub-problems that may be solved analyti¬ 
cally, eliminating the need for a numerical optimization 
algorithm. 1241 

Another approach is to use an interior point method 
that uses Newton-like iterations to find a solution of the 
Karush-Kuhn-Tucker conditions of the primal and dual 
problems. 1251 Instead of solving a sequence of broken 
down problems, this approach directly solves the prob¬ 
lem altogether. To avoid solving a linear system involving 
the large kernel matrix, a low rank approximation to the 
matrix is often used in the kernel trick. 

The special case of linear support vector machines can 
be solved more efficiently by the same kind of algo¬ 
rithms used to optimize its close cousin, logistic re¬ 
gression; this class of algorithms includes sub-gradient 
descent (e.g., PEGASOS 1261 ) and coordinate descent 
(e.g., LIBLINLAR 1271 ). LIBLINEAR has some attrac¬ 
tive training time properties. Each convergence itera¬ 
tion takes time linear in the time taken to read the train 
data and the iterations also have a Q-Linear Convergence 
property, making the algorithm extremely fast. 


372 


CHAPTER 60. SUPPORT VECTOR MACHINE 


The general kernel SVMs can also be solved more effi¬ 
ciently using sub-gradient descent (e.g. P-packSVM |2s| ), 
especially when parallelization is allowed. 

Kernel SVMs are available in many machine learn¬ 
ing toolkits, including LIBSVM, MATLAB, SVM- 
light, kernlab, scikit-learn. Shogun, Weka, Shark, 
JKernelMachines and others. 


60.11 Applications 

SVMs can be used to solve various real world problems: 

• SVMs are helpful in text and hypertext categoriza¬ 
tion as their application can significantly reduce the 
need for labeled training instances in both the stan¬ 
dard inductive and transductive settings. 

• Classification of images can also be performed us¬ 
ing SVMs. Experimental results show that SVMs 
achieve significantly higher search accuracy than 
traditional query refinement schemes after just three 
to four rounds of relevance feedback. 

• SVMs are also useful in medical science to classify 
proteins with up to 90% of the compounds classified 
correctly. 

• Hand-written characters can be recognized using 
SVM. 


60.12 See also 

• In situ adaptive tabulation 

• Kernel machines 

• Fisher kernel 

• Platt scaling 

• Polynomial kernel 

• Predictive analytics 

• Regularization perspectives on support vector ma¬ 
chines 

• Relevance vector machine, a probabilistic sparse 
kernel model identical in functional form to SVM 

• Sequential minimal optimization 

• Winnow (algorithm) 
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60.14 External links 

• www.support-vector.net The key book about the 
method, “An Introduction to Support Vector Ma¬ 
chines” with online software 

• Burges, Christopher J. C.; A Tutorial on Sup¬ 
port Vector Machines for Pattern Recognition, Data 
Mining and Knowledge Discovery 2:121-167, 1998 

• www.kernel-machines.org (general information and 
collection of research papers) 

• www.support-vector-machines.org (Literature, Re¬ 
view, Software, Links related to Support Vector Ma¬ 
chines — Academic Site) 

• videolectures.net (SVM-related video lectures) 

• Karatzoglou, Alexandras et al.; Support Vector Ma¬ 
chines in R, Journal of Statistical Software April 
2006, Volume 15, Issue 9. 

• libsvm LIBSVM is a popular library of SVM learn¬ 
ers 

• liblinear liblinear is a library for large linear classi¬ 
fication including some SVMs 

• Shark Shark is a C++ machine learning library im¬ 
plementing various types of SVMs 

• dlib dlib is a C++ library for working with kernel 
methods and SVMs 

• SVM light is a collection of software tools for learn¬ 
ing and classification using SVM. 

• SVMJS live demo is a GUI demo for Javascript im¬ 
plementation of SVMs 

• Gesture Recognition Toolkit contains an easy to use 
wrapper for libsvm 

60.15 Bibliography 

• Theodoridis, Sergios; and Koutroumbas, Konstanti- 
nos; “Pattern Recognition”, 4th Edition, Academic 
Press, 2009, ISBN 978-1-59749-272-0 

• Cristianini, Nello; and Shawe-Taylor, John; An In¬ 
troduction to Support Vector Machines and other 
kernel-based learning methods, Cambridge Univer¬ 
sity Press, 2000. ISBN 0-521-78019-5 ( SVM Book) 

• Huang, Te-Ming; Kecman, Vojislav; and Kopriva, 
Ivica (2006); Kernel Based Algorithms for Mining 
Huge Data Sets, in Supervised, Semi-supervised, and 
Unsupervised Learning, Springer-Verlag, Berlin, 
Heidelberg, 260 pp. 96 illus.. Hardcover, ISBN 3- 
540-31681-7 


CHAPTER 60. SUPPORT VECTOR MACHINE 


Kecman, Vojislav; Learning and Soft Computing — 
Support Vector Machines, Neural Networks, Fuzzy 
Logic Systems , The MIT Press, Cambridge, MA, 
2001 . 

Scholkopf, Bernhard; and Smola, Alexander J.; 
Learning with Kernels, MIT Press, Cambridge, MA, 
2002. ISBN 0-262-19475-9 

Scholkopf, Bernhard; Burges, Christopher J. C.; and 
Smola, Alexander J. (editors); Advances in Ker¬ 
nel Methods: Support Vector Learning, MIT Press, 
Cambridge, MA, 1999. ISBN 0-262-19416-3. 

Shawe-Taylor, John; and Cristianini, Nello; Kernel 
Methods for Pattern Analysis, Cambridge University 
Press, 2004. ISBN 0-521-81397-2 ( Kernel Methods 
Book) 

Steinwart, Ingo; and Christmann, Andreas; Support 
Vector Machines, Springer-Verlag, New York, 2008. 
ISBN 978-0-387-77241-7 ( SVM Book) 

Tan, Peter Jing; and Dowe, David L. (2004); MML 
Inference of Oblique Decision Trees, Lecture Notes 
in Artificial Intelligence (LNAI) 3339, Springer- 
Verlag, ppl082-1088. (This paper uses minimum 
message length (MML) and actually incorporates 
probabilistic support vector machines in the leaves 
of decision trees.) 

Vapnik, Vladimir N.; The Nature of Statistical 
Learning Theory, Springer-Verlag, 1995. ISBN 0- 
387-98780-0 

Vapnik, Vladimir N.; and Kotz, Samuel; Estimation 
of Dependences Based on Empirical Data, Springer, 
2006. ISBN 0-387-30865-2, 510 pages [this is a 
reprint of Vapnik’s early book describing philoso¬ 
phy behind SVM approach. The 2006 Appendix 
describes recent development]. 

Fradkin, Dmitriy; and Muchnik, Ilya; Support Vec¬ 
tor Machines for Classification in Abello, J.; and 
Carmode, G. (Eds); Discrete Methods in Epidemiol¬ 
ogy, DIMACS Series in Discrete Mathematics and 
Theoretical Computer Science, volume 70, pp. 13- 
20, 2006. . Succinctly describes theoretical ideas 
behind SVM. 

Bennett, Kristin P.; and Campbell, Colin; Support 
Vector Machines: Hype or Hallelujah?, SIGKDD 
Explorations, 2, 2, 2000, 1-13. . Excellent intro¬ 
duction to SVMs with helpful figures. 

Ivanciuc, Ovidiu; Applications of Support Vector 
Machines in Chemistry, in Reviews in Computational 
Chemistry, Volume 23, 2007, pp. 291-400. Reprint 
available: 


• Catanzaro, Bryan; Sundaram, Narayanan; and 
Keutzer, Kurt; Fast Support Vector Machine Train¬ 
ing and Classification on Graphics Processors, in In¬ 
ternational Conference on Machine Learning, 2008 

• Campbell, Colin; and Ying, Yiming; Learning with 
Support Vector Machines, 2011, Morgan and Clay- 
pool. ISBN 978-1-60845-616-1. 


Chapter 61 

Artificial neural network 


“Neural network” redirects here. For networks of living 
neurons, see Biological neural network. For the journal, 
see Neural Networks (journal). For the evolutionary 
concept, see Neutral network (evolution). 


Hidden 



An artificial neural network is an interconnected group of nodes, 
akin to the vast network of neurons in a brain. Here, each circu¬ 
lar node represents an artificial neuron and an arrow represents 
a connection from the output of one neuron to the input of an¬ 
other. 

In machine learning and cognitive science, artificial neu¬ 
ral networks (ANNs) are a family of statistical learning 
models inspired by biological neural networks (the central 
nervous systems of animals, in particular the brain) and 
are used to estimate or approximate functions that can 
depend on a large number of inputs and are generally 
unknown. Artificial neural networks are generally pre¬ 
sented as systems of interconnected "neurons" which send 
messages to each other. The connections have numeric 
weights that can be tuned based on experience, making 
neural nets adaptive to inputs and capable of learning. 


For example, a neural network for handwriting recogni¬ 
tion is defined by a set of input neurons which may be 
activated by the pixels of an input image. After being 
weighted and transformed by a function (determined by 
the network’s designer), the activations of these neurons 
are then passed on to other neurons. This process is re¬ 
peated until finally, an output neuron is activated. This 
determines which character was read. 

Like other machine learning methods - systems that learn 
from data - neural networks have been used to solve a 
wide variety of tasks that are hard to solve using ordinary 
rule-based programming, including computer vision and 
speech recognition. 

61.1 Background 

Examinations of humans’ central nervous systems in¬ 
spired the concept of artificial neural networks. In an ar¬ 
tificial neural network, simple artificial nodes, known as 
"neurons", “neurodes”, “processing elements” or “units”, 
are connected together to form a network which mimics 
a biological neural network. 

There is no single formal definition of what an artificial 
neural network is. However, a class of statistical mod¬ 
els may commonly be called “Neural” if it possesses the 
following characteristics: 

1. contains sets of adaptive weights, i.e. numerical pa¬ 
rameters that are tuned by a learning algorithm, and 

2. capability of approximating non-linear functions of 
their inputs. 

The adaptive weights can be thought of as connection 
strengths between neurons, which are activated during 
training and prediction. 

Neural networks are similar to biological neural networks 
in the performing of functions collectively and in parallel 
by the units, rather than there being a clear delineation 
of subtasks to which individual units are assigned. The 
term “neural network” usually refers to models employed 
in statistics, cognitive psychology and artificial intelli¬ 
gence. Neural network models which emulate the central 
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nervous system are part of theoretical neuroscience and 
computational neuroscience. 

In modern software implementations of artificial neu¬ 
ral networks, the approach inspired by biology has been 
largely abandoned for a more practical approach based 
on statistics and signal processing. In some of these sys¬ 
tems, neural networks or parts of neural networks (like 
artificial neurons) form components in larger systems that 
combine both adaptive and non-adaptive elements. While 
the more general approach of such systems is more suit¬ 
able for real-world problem solving, it has little to do with 
the traditional, artificial intelligence connectionist mod¬ 
els. What they do have in common, however, is the prin¬ 
ciple of non-linear, distributed, parallel and local process¬ 
ing and adaptation. Historically, the use of neural net¬ 
work models marked a directional shift in the late eight¬ 
ies from high-level (symbolic) AI, characterized by expert 
systems with knowledge embodied in if-then rules, to low- 
level (sub-symbolic) machine learning, characterized by 
knowledge embodied in the parameters of a dynamical 
system. 

61.2 History 

Warren McCulloch and Walter Pitts [1] (1943) created 
a computational model for neural networks based on 
mathematics and algorithms called threshold logic. This 
model paved the way for neural network research to split 
into two distinct approaches. One approach focused on 
biological processes in the brain and the other focused 
on the application of neural networks to artificial intelli¬ 
gence. 

In the late 1940s psychologist Donald Hebb 121 created a 
hypothesis of learning based on the mechanism of neural 
plasticity that is now known as Hebbian learning. Heb- 
bian learning is considered to be a 'typical' unsupervised 
learning rule and its later variants were early models 
for long term potentiation. Researchers started apply¬ 
ing these ideas to computational models in 1948 with 
Turing’s B-type machines. 

Farley and Wesley A. Clark 131 (1954) first used compu¬ 
tational machines, then called “calculators,” to simulate 
a Hebbian network at MIT. Other neural network com¬ 
putational machines were created by Rochester, Holland, 
Habit, and Duda |4] (1956). 

Frank Rosenblatt 151 (1958) created the perceptron, an al¬ 
gorithm for pattern recognition based on a two-layer com¬ 
puter learning network using simple addition and sub¬ 
traction. With mathematical notation, Rosenblatt also 
described circuitry not in the basic perceptron, such 
as the exclusive-or circuit, a circuit whose mathemati¬ 
cal computation could not be processed until after the 
backpropagation algorithm was created by Paul Wer- 
bos [61 (1975). 

Neural network research stagnated after the publication 


of machine learning research by Marvin Minsky and 
Seymour Papert 171 (1969), who discovered two key is¬ 
sues with the computational machines that processed neu¬ 
ral networks. The first was that single-layer neural net¬ 
works were incapable of processing the exclusive-or cir¬ 
cuit. The second significant issue was that comput¬ 
ers didn't have enough processing power to effectively 
handle the long run time required by large neural net¬ 
works. Neural network research slowed until comput¬ 
ers achieved greater processing power. Another key ad¬ 
vance that came later was the backpropagation algorithm 
which effectively solved the exclusive-or problem (Wer- 
bos 1975). [6] 

The parallel distributed processing of the mid-1980s be¬ 
came popular under the name connectionism. The text¬ 
book by David E. Rumelhart and James McClelland 181 
(1986) provided a full exposition of the use of connec¬ 
tionism in computers to simulate neural processes. 

Neural networks, as used in artificial intelligence, have 
traditionally been viewed as simplified models of neural 
processing in the brain, even though the relation between 
this model and the biological architecture of the brain is 
debated; it’s not clear to what degree artificial neural net¬ 
works mirror brain function. 191 

Support vector machines and other, much simpler meth¬ 
ods such as linear classifiers gradually overtook neural 
networks in machine learning popularity. But the advent 
of deep learning in the late 2000s sparked renewed inter¬ 
est in neural nets. 


61.2.1 Improvements since 2006 

Computational devices have been created in CMOS, for 
both biophysical simulation and neuromorphic comput¬ 
ing. More recent efforts show promise for creating 
nanodevices 1 101 for very large scale principal components 
analyses and convolution. If successful, these efforts 
could usher in a new era of neural computing 1 111 that is 
a step beyond digital computing, because it depends on 
learning rather than programming and because it is fun¬ 
damentally analog rather than digital even though the first 
instantiations may in fact be with CMOS digital devices. 

Between 2009 and 2012, the recurrent neural networks 
and deep feedforward neural networks developed in the 
research group of Jurgen Schmidhuber at the Swiss AI 
Lab IDSIA have won eight international competitions 
in pattern recognition and machine learning.! 1211131 For 
example, the bi-directional and multi-dimensional long 
short term memory (LSTM) [14][15] ! 16 !! 171 of Alex Graves 
et al. won three competitions in connected handwriting 
recognition at the 2009 International Conference on Doc¬ 
ument Analysis and Recognition (ICDAR), without any 
prior knowledge about the three different languages to be 
learned. 

Fast GPU-based implementations of this approach by 
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Dan Ciresan and colleagues at IDSIA have won several 
pattern recognition contests, including the IJCNN 2011 
Traffic Sign Recognition Competition, 1 18,1191 the ISBI 
2012 Segmentation of Neuronal Structures in Electron 
Microscopy Stacks challenge, 1201 and others. Their neu¬ 
ral networks also were the first artificial pattern recog¬ 
nizers to achieve human-competitive or even superhuman 
performance 1211 on important benchmarks such as traffic 
sign recognition (IJCNN 2012), or the MNIST handwrit¬ 
ten digits problem of Yann LeCun at NYU. 

Deep, highly nonlinear neural architectures similar to the 
1980 neocognitron by Kunihiko Fukushima 1221 and the 
“standard architecture of vision”, 1231 inspired by the sim¬ 
ple and complex cells identified by David H. Hubei and 
Torsten Wiesel in the primary visual cortex, can also be 
pre-trained by unsupervised methods 1241 1251 of Geoff Hin¬ 
ton's lab at University of Toronto. 1261 1271 A team from this 
lab won a 2012 contest sponsored by Merck to design 
software to help find molecules that might lead to new 
drugs. 1281 

61.3 Models 

Neural network models in artificial intelligence are usu¬ 
ally referred to as artificial neural networks (ANNs); 
these are essentially simple mathematical models defin¬ 
ing a function / : X —> Y or a distribution over A' or 
both X and Y , but sometimes models are also intimately 
associated with a particular learning algorithm or learn¬ 
ing rule. A common use of the phrase “ANN model” is 
really the definition of a class of such functions (where 
members of the class are obtained by varying parame¬ 
ters, connection weights, or specifics of the architecture 
such as the number of neurons or their connectivity). 

61.3.1 Network function 

See also: Graphical models 

The word network in the term 'artificial neural network' 
refers to the inter-connections between the neurons in the 
different layers of each system. An example system has 
three layers. The first layer has input neurons which send 
data via synapses to the second layer of neurons, and then 
via more synapses to the third layer of output neurons. 
More complex systems will have more layers of neurons, 
some having increased layers of input neurons and output 
neurons. The synapses store parameters called “weights” 
that manipulate the data in the calculations. 

An ANN is typically defined by three types of parameters: 

1. The interconnection pattern between the different 
layers of neurons 

2. The learning process for updating the weights of the 


interconnections 

3. The activation function that converts a neuron’s 
weighted input to its output activation. 

Mathematically, a neuron’s network function f(x) is de¬ 
fined as a composition of other functions gt{x) , which 
can further be defined as a composition of other func¬ 
tions. This can be conveniently represented as a net¬ 
work structure, with arrows depicting the dependen¬ 
cies between variables. A widely used type of com¬ 
position is the nonlinear weighted sum. where f(x) = 
K (JY Wigi(x )), where I\ (commonly referred to as the 
activation function 1291 ) is some predefined function, such 
as the hyperbolic tangent. It will be convenient for the 
following to refer to a collection of functions gi as simply 
a vector g = (gi, g 2 , ■ ■ ■, g n ) ■ 



ANN dependency graph 

This figure depicts such a decomposition of / , with de¬ 
pendencies between variables indicated by arrows. These 
can be interpreted in two ways. 

The first view is the functional view: the input x is trans¬ 
formed into a 3-dimensional vector h , which is then 
transformed into a 2-dimensional vector g , which is fi¬ 
nally transformed into / . This view is most commonly 
encountered in the context of optimization. 

The second view is the probabilistic view: the random 
variable F = f(G) depends upon the random variable 
G = g(H) , which depends upon H = h(X) , which 
depends upon the random variable X . This view is most 
commonly encountered in the context of graphical mod¬ 
els. 

The two views are largely equivalent. In either case, for 
this particular network architecture, the components of 
individual layers are independent of each other (e.g., the 
components of g are independent of each other given their 
input h ). This naturally enables a degree of parallelism 
in the implementation. 

Networks such as the previous one are commonly called 
feedforward, because their graph is a directed acyclic 
graph. Networks with cycles are commonly called 
recurrent. Such networks are commonly depicted in the 
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Two separate depictions of the recurrent ANN dependency graph 


manner shown at the top of the figure, where / is shown as 
being dependent upon itself. However, an implied tem¬ 
poral dependence is not shown. 

61.3.2 Learning 

What has attracted the most interest in neural networks is 
the possibility of learning. Given a specific task to solve, 
and a class of functions F , learning means using a set 
of observations to find f* £ F which solves the task in 
some optimal sense. 

This entails defining a cost function C : F —> R. such that, 
for the optimal solution /* , C{f*) < C(f) V/ £ F - 
i.e., no solution has a cost less than the cost of the optimal 
solution (see Mathematical optimization). 

The cost function C is an important concept in learning, 
as it is a measure of how far away a particular solution 
is from an optimal solution to the problem to be solved. 
Learning algorithms search through the solution space to 
find a function that has the smallest possible cost. 

For applications where the solution is dependent on some 
data, the cost must necessarily be a function of the obser¬ 
vations, otherwise we would not be modelling anything 
related to the data. It is frequently defined as a statistic 
to which only approximations can be made. As a sim¬ 
ple example, consider the problem of finding the model 
/ , which minimizes C = E [(/( x) — y) 2 ] , for data 
pairs (x,y) drawn from some distribution T> . In prac¬ 
tical situations we would only have N samples from V 
and thus, for the above example, we would only minimize 
C = YliLiif ( x i) ~Hi) 2 ■ Thus, the cost is minimized 
over a sample of the data rather than the entire distribu¬ 
tion generating the data. 


When N —> oo some form of online machine learning 
must be used, where the cost is partially minimized as 
each new example is seen. While online machine learning 
is often used when V is fixed, it is most useful in the case 
where the distribution changes slowly over time. In neural 
network methods, some form of online machine learning 
is frequently used for finite datasets. 

See also: Mathematical optimization. Estimation theory 
and Machine learning 


Choosing a cost function 

While it is possible to define some arbitrary ad hoc cost 
function, frequently a particular cost will be used, either 
because it has desirable properties (such as convexity) or 
because it arises naturally from a particular formulation 
of the problem (e.g., in a probabilistic formulation the 
posterior probability of the model can be used as an in¬ 
verse cost). Ultimately, the cost function will depend on 
the desired task. An overview of the three main cate¬ 
gories of learning tasks is provided below: 

61.3.3 Learning paradigms 

There are three major learning paradigms, each corre¬ 
sponding to a particular abstract learning task. These 
are supervised learning, unsupervised learning and 
reinforcement learning. 

Supervised learning 

In supervised learning, we are given a set of example pairs 
(x,y),x £ X,y £ Y and the aim is to find a function 
/ : X —> Y in the allowed class of functions that matches 
the examples. In other words, we wish to infer the map¬ 
ping implied by the data; the cost function is related to the 
mismatch between our mapping and the data and it im¬ 
plicitly contains prior knowledge about the problem do¬ 
main. 

A commonly used cost is the mean-squared error, which 
tries to minimize the average squared error between the 
network’s output, f{x) , and the target value y over all 
the example pairs. When one tries to minimize this cost 
using gradient descent for the class of neural networks 
called multilayer perceptrons, one obtains the common 
and well-known backpropagation algorithm for training 
neural networks. 

Tasks that fall within the paradigm of supervised learn¬ 
ing are pattern recognition (also known as classification) 
and regression (also known as function approximation). 
The supervised learning paradigm is also applicable to 
sequential data (e.g., for speech and gesture recognition). 
This can be thought of as learning with a “teacher”, in the 
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form of a function that provides continuous feedback on 
the quality of solutions obtained thus far. 

Unsupervised learning 

In unsupervised learning, some data x is given and the 
cost function to be minimized, that can be any function 
of the data x and the network’s output, / . 

The cost function is dependent on the task (what we are 
trying to model) and our a priori assumptions (the implicit 
properties of our model, its parameters and the observed 
variables). 

As a trivial example, consider the model f{x) = a where 
a is a constant and the cost C = E[(x — f(x)) 2 } . Mini¬ 
mizing this cost will give us a value of a that is equal to the 
mean of the data. The cost function can be much more 
complicated. Its form depends on the application: for ex¬ 
ample, in compression it could be related to the mutual 
information between x and fix) , whereas in statistical 
modeling, it could be related to the posterior probability 
of the model given the data (note that in both of those ex¬ 
amples those quantities would be maximized rather than 
minimized). 

Tasks that fall within the paradigm of unsupervised learn¬ 
ing are in general estimation problems; the applications 
include clustering, the estimation of statistical distribu¬ 
tions, compression and filtering. 

Reinforcement learning 

In reinforcement learning, data x are usually not given, 
but generated by an agent’s interactions with the environ¬ 
ment. At each point in time t , the agent performs an 
action y t and the environment generates an observation 
Xt and an instantaneous cost Ct , according to some (usu¬ 
ally unknown) dynamics. The aim is to discover a policy 
for selecting actions that minimizes some measure of a 
long-term cost, e.g., the expected cumulative cost. The 
environment’s dynamics and the long-term cost for each 
policy are usually unknown, but can be estimated. 

More formally the environment is modeled as a Markov 
decision process (MDP) with states si,...,s n £ S and 
actions ai,..., a m £ A with the following probability dis¬ 
tributions: the instantaneous cost distribution P(ct\st ) 
, the observation distribution P(xt\st) and the transi¬ 
tion P(s t +i\s t ,a t ) , while a policy is defined as the 
conditional distribution over actions given the observa¬ 
tions. Taken together, the two then define a Markov chain 
(MC). The aim is to discover the policy (i.e., the MC) that 
minimizes the cost. 

ANNs are frequently used in reinforcement learning as 
part of the overall algorithm. 13011311 Dynamic program¬ 
ming has been coupled with ANNs (Neuro dynamic pro¬ 
gramming) by Bertsekas and Tsitsiklis 132 ' and applied 
to multi-dimensional nonlinear problems such as those 


involved in vehicle routing, 1331 natural resources man¬ 
agement 13411351 or medicine 1361 because of the ability of 
ANNs to mitigate losses of accuracy even when reduc¬ 
ing the discretization grid density for numerically approx¬ 
imating the solution of the original control problems. 

Tasks that fall within the paradigm of reinforcement 
learning are control problems, games and other sequential 
decision making tasks. 

See also: dynamic programming and stochastic control 


61.3.4 Learning algorithms 

Training a neural network model essentially means se¬ 
lecting one model from the set of allowed models (or, 
in a Bayesian framework, determining a distribution over 
the set of allowed models) that minimizes the cost crite¬ 
rion. There are numerous algorithms available for train¬ 
ing neural network models; most of them can be viewed 
as a straightforward application of optimization theory 
and statistical estimation. 

Most of the algorithms used in training artificial neural 
networks employ some form of gradient descent, using 
backpropagation to compute the actual gradients. This is 
done by simply taking the derivative of the cost function 
with respect to the network parameters and then chang¬ 
ing those parameters in a gradient-related direction. The 
backpropagation training algorithms are usually classi¬ 
fied into three categories: steepest descent (with vari¬ 
able learning rate, with variable learning rate and momen¬ 
tum, resilient backpropagation), quasi-Newton (Broyden- 
Fletcher-Goldfarb-Shanno, one step secant, Levenberg- 
Marquardt) and conjugate gradient (Fletcher-Reeves up¬ 
date, Polak-Ribiere update, Powell-Beale restart, scaled 
conjugate gradient). 1371 

Evolutionary methods, 1381 gene expression pro¬ 
gramming,^ 391 simulated annealing, 1401 expectation- 
maximization, non-parametric methods and particle 
swarm optimization 1411 are some commonly used 
methods for training neural networks. 

See also: machine learning 

61.4 Employing artificial neural 
networks 

Perhaps the greatest advantage of ANNs is their ability 
to be used as an arbitrary function approximation mech¬ 
anism that learns’ from observed data. However, using 
them is not so straightforward, and a relatively good un¬ 
derstanding of the underlying theory is essential. 

• Choice of model: This will depend on the data rep- 
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reservation and the application. Overly complex 
models tend to lead to problems with learning. 

• Learning algorithm: There are numerous trade-offs 
between learning algorithms. Almost any algorithm 
will work well with the correct hyperparameters for 
training on a particular fixed data set. However, se¬ 
lecting and tuning an algorithm for training on un¬ 
seen data requires a significant amount of experi¬ 
mentation. 

• Robustness: If the model, cost function and learn¬ 
ing algorithm are selected appropriately the result¬ 
ing ANN can be extremely robust. 

With the correct implementation, ANNs can be used nat¬ 
urally in online learning and large data set applications. 
Their simple implementation and the existence of mostly 
local dependencies exhibited in the structure allows for 
fast, parallel implementations in hardware. 

61.5 Applications 

The utility of artificial neural network models lies in the 
fact that they can be used to infer a function from obser¬ 
vations. This is particularly useful in applications where 
the complexity of the data or task makes the design of 
such a function by hand impractical. 

61.5.1 Real-life applications 

The tasks artificial neural networks are applied to tend to 
fall within the following broad categories: 

• Function approximation, or regression analysis, in¬ 
cluding time series prediction, fitness approximation 
and modeling. 

• Classification, including pattern and sequence 
recognition, novelty detection and sequential deci¬ 
sion making. 

• Data processing, including filtering, clustering, blind 
source separation and compression. 

• Robotics, including directing manipulators, 
prosthesis. 

• Control, including Computer numerical control. 

Application areas include the system identification and 
control (vehicle control, process control, natural re¬ 
sources management), quantum chemistry, 1421 game¬ 
playing and decision making (backgammon, chess, 
poker), pattern recognition (radar systems, face identi¬ 
fication, object recognition and more), sequence recog¬ 
nition (gesture, speech, handwritten text recognition). 


medical diagnosis, financial applications (e.g. automated 
trading systems), data mining (or knowledge discovery in 
databases, “KDD”), visualization and e-mail spam filter¬ 
ing. 

Artificial neural networks have also been used to diag¬ 
nose several cancers. An ANN based hybrid lung cancer 
detection system named HLND improves the accuracy 
of diagnosis and the speed of lung cancer radiology. 143 ' 
These networks have also been used to diagnose prostate 
cancer. The diagnoses can be used to make specific mod¬ 
els taken from a large group of patients compared to in¬ 
formation of one given patient. The models do not de¬ 
pend on assumptions about correlations of different vari¬ 
ables. Colorectal cancer has also been predicted using 
the neural networks. Neural networks could predict the 
outcome for a patient with colorectal cancer with more 
accuracy than the current clinical methods. After train¬ 
ing, the networks could predict multiple patient outcomes 
from unrelated institutions.' 44 ' 

61.5.2 Neural networks and neuroscience 

Theoretical and computational neuroscience is the field 
concerned with the theoretical analysis and the computa¬ 
tional modeling of biological neural systems. Since neu¬ 
ral systems are intimately related to cognitive processes 
and behavior, the field is closely related to cognitive and 
behavioral modeling. 

The aim of the field is to create models of biological neu¬ 
ral systems in order to understand how biological systems 
work. To gain this understanding, neuroscientists strive 
to make a link between observed biological processes 
(data), biologically plausible mechanisms for neural pro¬ 
cessing and learning (biological neural network models) 
and theory (statistical learning theory and information 
theory). 

Types of models 

Many models are used in the field, defined at different 
levels of abstraction and modeling different aspects of 
neural systems. They range from models of the short¬ 
term behavior of individual neurons (e.g. ' 45 '), models 
of how the dynamics of neural circuitry arise from inter¬ 
actions between individual neurons and finally to models 
of how behavior can arise from abstract neural modules 
that represent complete subsystems. These include mod¬ 
els of the long-term, and short-term plasticity, of neural 
systems and their relations to learning and memory from 
the individual neuron to the system level. 

Memory networks 

Integrating external memory components with artifi¬ 
cial neural networks has a long history dating back to 
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early research in distributed representations 14,4 and self¬ 
organizing maps. E.g. in sparse distributed memory 
the patterns encoded by neural networks are used as 
memory addresses for content-addressable memory, with 
“neurons” essentially serving as address encoders and 
decoders. 

More recently deep learning was shown to be useful in 
semantic hashing 1471 where a deep graphical model the 
word-count vectors 1481 obtained from a large set of doc¬ 
uments. Documents are mapped to memory addresses in 
such a way that semantically similar documents are lo¬ 
cated at nearby addresses. Documents similar to a query 
document can then be found by simply accessing all the 
addresses that differ by only a few bits from the address 
of the query document. 

Neural Turing Machines 1491 developed by Google Deep- 
Mind extend the capabilities of deep neural networks by 
coupling them to external memory resources, which they 
can interact with by attentional processes. The combined 
system is analogous to a Turing Machine but is differ¬ 
entiable end-to-end, allowing it to be efficiently trained 
with gradient descent. Preliminary results demonstrate 
that Neural Turing Machines can infer simple algorithms 
such as copying, sorting, and associative recall from input 
and output examples. 

Memory Networks 1501 is another extension to neural net¬ 
works incorporating long-term memory which was devel¬ 
oped by Facebook research. The long-term memory can 
be read and written to, with the goal of using it for pre¬ 
diction. These models have been applied in the context 
of question answering (QA) where the long-term mem¬ 
ory effectively acts as a (dynamic) knowledge base, and 
the output is a textual response. 

61.6 Neural network software 

Main article: Neural network software 

Neural network software is used to simulate, research, 
develop and apply artificial neural networks, biological 
neural networks and, in some cases, a wider array of 
adaptive systems. 

61.7 Types of artificial neural net¬ 
works 

Main article: Types of artificial neural networks 

Artificial neural network types vary from those with only 
one or two layers of single direction logic, to compli¬ 
cated multi-input many directional feedback loops and 
layers. On the whole, these systems use algorithms in 
their programming to determine control and organization 


of their functions. Most systems use “weights” to change 
the parameters of the throughput and the varying con¬ 
nections to the neurons. Artificial neural networks can be 
autonomous and learn by input from outside “teachers” or 
even self-teaching from written-in rules. 

61.8 Theoretical properties 

61.8.1 Computational power 

The multi-layer perceptron (MLP) is a universal function 
approximator, as proven by the universal approximation 
theorem. However, the proof is not constructive regard¬ 
ing the number of neurons required or the settings of the 
weights. 

Work by Hava Siegelmann and Eduardo D. Sontag has 
provided a proof that a specific recurrent architecture 
with rational valued weights (as opposed to full preci¬ 
sion real number- valued weights) has the full power of 
a Universal Turing Machine 1511 using a finite number of 
neurons and standard linear connections. Further, it has 
been shown that the use of irrational values for weights 
results in a machine with super-Turing power. 1521 

61.8.2 Capacity 

Artificial neural network models have a property called 
'capacity', which roughly corresponds to their ability to 
model any given function. It is related to the amount of 
information that can be stored in the network and to the 
notion of complexity. 

61.8.3 Convergence 

Nothing can be said in general about convergence since it 
depends on a number of factors. Firstly, there may exist 
many local minima. This depends on the cost function 
and the model. Secondly, the optimization method used 
might not be guaranteed to converge when far away from 
a local minimum. Thirdly, for a very large amount of 
data or parameters, some methods become impractical. 
In general, it has been found that theoretical guarantees 
regarding convergence are an unreliable guide to practical 
application. 

61.8.4 Generalization and statistics 

In applications where the goal is to create a system that 
generalizes well in unseen examples, the problem of over¬ 
training has emerged. This arises in convoluted or over- 
specified systems when the capacity of the network sig¬ 
nificantly exceeds the needed free parameters. There 
are two schools of thought for avoiding this problem: 
The first is to use cross-validation and similar techniques 
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to check for the presence of overtraining and optimally 
select hyperparameters such as to minimize the gener¬ 
alization error. The second is to use some form of 
regularization. This is a concept that emerges naturally in 
a probabilistic (Bayesian) framework, where the regular¬ 
ization can be performed by selecting a larger prior prob¬ 
ability over simpler models; but also in statistical learning 
theory, where the goal is to minimize over two quantities: 
the 'empirical risk' and the 'structural risk', which roughly 
corresponds to the error over the training set and the pre¬ 
dicted error in unseen data due to overfitting. 



Confidence analysis of a neural network 


Supervised neural networks that use a mean squared error 
(MSE) cost function can use formal statistical methods to 
determine the confidence of the trained model. The MSE 
on a validation set can be used as an estimate for variance. 
This value can then be used to calculate the confidence 
interval of the output of the network, assuming a normal 
distribution. A confidence analysis made this way is sta¬ 
tistically valid as long as the output probability distribu¬ 
tion stays the same and the network is not modified. 

By assigning a softmax activation function, a generaliza¬ 
tion of the logistic function, on the output layer of the 
neural network (or a softmax component in a component- 
based neural network) for categorical target variables, the 
outputs can be interpreted as posterior probabilities. This 
is very useful in classification as it gives a certainty mea¬ 
sure on classifications. 

The softmax activation function is: 


61.9 Controversies 

61.9.1 Training issues 

A common criticism of neural networks, particularly in 
robotics, is that they require a large diversity of training 


for real-world operation . This is not surprising, since 
any learning machine needs sufficient representative ex¬ 
amples in order to capture the underlying structure that 
allows it to generalize to new cases. Dean Pomerleau, 
in his research presented in the paper “Knowledge-based 
Training of Artificial Neural Networks for Autonomous 
Robot Driving,” uses a neural network to train a robotic 
vehicle to drive on multiple types of roads (single lane, 
multi-lane, dirt, etc.). A large amount of his research 
is devoted to (1) extrapolating multiple training scenar¬ 
ios from a single training experience, and (2) preserving 
past training diversity so that the system does not become 
overtrained (if, for example, it is presented with a series 
of right turns - it should not learn to always turn right). 
These issues are common in neural networks that must de¬ 
cide from amongst a wide variety of responses, but can be 
dealt with in several ways, for example by randomly shuf¬ 
fling the training examples, by using a numerical opti¬ 
mization algorithm that does not take too large steps when 
changing the network connections following an example, 
or by grouping examples in so-called mini-batches. 

A. K. Dewdney, a former Scientific American columnist, 
wrote in 1997, “Although neural nets do solve a few toy 
problems, their powers of computation are so limited that 
I am surprised anyone takes them seriously as a general 
problem-solving tool.” (Dewdney, p. 82) 


61.9.2 Hardware issues 


To implement large and effective software neural net¬ 
works, considerable processing and storage resources 
need to be committed . While the brain has hardware 
tailored to the task of processing signals through a graph 
of neurons, simulating even a most simplified form on 
Von Neumann technology may compel a neural network 
designer to fill many millions of database rows for its con¬ 
nections - which can consume vast amounts of computer 
memory and hard disk space. Furthermore, the designer 
of neural network systems will often need to simulate 
the transmission of signals through many of these con¬ 
nections and their associated neurons - which must often 
be matched with incredible amounts of CPU processing 
power and time. While neural networks often yield effec¬ 
tive programs, they too often do so at the cost of efficiency 
(they tend to consume considerable amounts of time and 
money). 

Computing power continues to grow roughly according 
to Moore’s Law, which may provide sufficient resources 
to accomplish new tasks. Neuromorphic engineering ad¬ 
dresses the hardware difficulty directly, by constructing 
non-Von-Neumann chips with circuits designed to imple¬ 
ment neural nets from the ground up. 
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61.9.3 Practical counterexamples to criti¬ 
cisms 

Arguments against Dewdney’s position are that neural 
networks have been successfully used to solve many com¬ 
plex and diverse tasks, ranging from autonomously flying 
aircraft 1531 to detecting credit card fraud . 

Technology writer Roger Bridgman commented on 
Dewdney’s statements about neural nets: 

Neural networks, for instance, are in the 
dock not only because they have been hyped 
to high heaven, (what hasn't?) but also be¬ 
cause you could create a successful net with¬ 
out understanding how it worked: the bunch 
of numbers that captures its behaviour would 
in all probability be “an opaque, unreadable ta¬ 
ble... valueless as a scientific resource”. 

In spite of his emphatic declaration that 
science is not technology, Dewdney seems here 
to pillory neural nets as bad science when most 
of those devising them are just trying to be 
good engineers. An unreadable table that a 
useful machine could read would still be well 
worth having. 1541 

Although it is true that analyzing what has been learned 
by an artificial neural network is difficult, it is much eas¬ 
ier to do so than to analyze what has been learned by a 
biological neural network. Furthermore, researchers in¬ 
volved in exploring learning algorithms for neural net¬ 
works are gradually uncovering generic principles which 
allow a learning machine to be successful. For exam¬ 
ple, Bengio and LeCun (2007) wrote an article regard¬ 
ing local vs non-local learning, as well as shallow vs deep 
architecture. 1551 

61.9.4 Hybrid approaches 

Some other criticisms come from advocates of hybrid 
models (combining neural networks and symbolic ap¬ 
proaches), who believe that the intermix of these two ap¬ 
proaches can better capture the mechanisms of the human 
mind. t56] [5?I 


61.10 Gallery 

• A single-layer feedforward artificial neural network. 
Arrows originating from are omitted for clarity. 
There are p inputs to this network and q outputs. 
In this system, the value of the qth output, would be 
calculated as 

• A two-layer feedforward artificial neural network. 


61.11 See also 

• 20Q 

• AD ALINE 

• Adaptive resonance theory 

• Artificial life 

• Associative memory 

• Autoencoder 

• Backpropagation 

• BEAM robotics 

• Biological cybernetics 

• Biologically inspired computing 

• Blue brain 

• Catastrophic interference 

• Cerebellar Model Articulation Controller 

• Cognitive architecture 

• Cognitive science 

• Convolutional neural network (CNN) 

• Connectionist expert system 

• Connectomics 

• Cultured neuronal networks 

• Deep learning 

• Digital morphogenesis 

• Encog 

• Fuzzy logic 

• Gene expression programming 
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• Group method of data handling 

• Habituation 

• In Situ Adaptive Tabulation 

• Models of neural computation 

• Multilinear subspace learning 

• Neuroevolution 

• Neural coding 
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• Neural gas 

• Neural network software 

• Neuroscience 

• Nil000 chip 

• Nonlinear system identification 

• Optical neural network 

• Parallel Constraint Satisfaction Processes 

• Parallel distributed processing 

• Radial basis function network 

• Recurrent neural networks 

• Self-organizing map 

• Spiking neural network 

• Systolic array 

• Tensor product network 

• Time delay neural network (TDNN) 
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Chapter 62 

Deep learning 


For deep versus shallow learning in educational psychol¬ 
ogy, see Student approaches to learning 

Deep learning (deep machine learning, or deep structured 
learning, or hierarchical learning, or sometimes DL) is a 
branch of machine learning based on a set of algorithms 
that attempt to model high-level abstractions in data by 
using model architectures, with complex structures or 
otherwise, composed of multiple non-linear transforma- 
tions. [1](pl98)[2]t3][41 

Deep learning is part of a broader family of machine 
learning methods based on learning representations of 
data. An observation (e.g., an image) can be represented 
in many ways such as a vector of intensity values per pixel, 
or in a more abstract way as a set of edges, regions of 
particular shape, etc.. Some representations make it eas¬ 
ier to learn tasks (e.g., face recognition or facial expres¬ 
sion recognition 151 ) from examples. One of the promises 
of deep learning is replacing handcrafted features with 
efficient algorithms for unsupervised or semi-supervised 
feature learning and hierarchical feature extraction. 161 

Research in this area attempts to make better represen¬ 
tations and create models to learn these representations 
from large-scale unlabeled data. Some of the represen¬ 
tations are inspired by advances in neuroscience and are 
loosely based on interpretation of information processing 
and communication patterns in a nervous system, such 
as neural coding which attempts to define a relationship 
between the stimulus and the neuronal responses and the 
relationship among the electrical activity of the neurons 
in the brain. 171 

Various deep learning architectures such as deep neural 
networks, convolutional deep neural networks, deep be¬ 
lief networks and recurrent neural networks have been 
applied to fields like computer vision, automatic speech 
recognition, natural language processing, audio recogni¬ 
tion and bioinformatics where they have been shown to 
produce state-of-the-art results on various tasks. 

Alternatively, deep learning has been characterized as a 
buzzword, or a rebranding of neural networks. 181191 


62.1 Introduction 

62.1.1 Definitions 

There are a number of ways that the field of deep learn¬ 
ing has been characterized. Deep learning is a class of 
machine learning algorithms that 111(pp 199-2001 

• use a cascade of many layers of nonlinear pro¬ 
cessing units for feature extraction and transforma¬ 
tion. Each successive layer uses the output from 
the previous layer as input. The algorithms may 
be supervised or unsupervised and applications in¬ 
clude pattern analysis (unsupervised) and classifica¬ 
tion (supervised). 

• are based on the (unsupervised) learning of multi¬ 
ple levels of features or representations of the data. 
Higher level features are derived from lower level 
features to form a hierarchical representation. 

• are part of the broader machine learning field of 
learning representations of data. 

• learn multiple levels of representations that corre¬ 
spond to different levels of abstraction; the levels 
form a hierarchy of concepts. 

These definitions have in common (1) multiple layers 
of nonlinear processing units and (2) the supervised or 
unsupervised learning of feature representations in each 
layer, with the layers forming a hierarchy from low-level 
to high-level features. 1 11 (p2001 The composition of a layer 
of nonlinear processing units used in a deep learning al¬ 
gorithm depends on the problem to be solved. Layers 
that have been used in deep learning include hidden lay¬ 
ers of an artificial neural network and sets of complicated 
propositional formulas. 121 They may also include latent 
variables organized layer-wise in deep generative mod¬ 
els such as the nodes in Deep Belief Networks and Deep 
Boltzmann Machines. 

Deep learning algorithms are contrasted with shallow 
learning algorithms by the number of parameterized 
transformations a signal encounters as it propagates from 
the input layer to the output layer, where a parameterized 
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transformation is a processing unit that has trainable pa¬ 
rameters, such as weights and thresholds. 141 (p6 ' A chain 
of transformations from input to output is a credit assign¬ 
ment path (CAP). CAPs describe potentially causal con¬ 
nections between input and output and may vary in length. 
For a feedforward neural network, the depth of the CAPs, 
and thus the depth of the network, is the number of hid¬ 
den layers plus one (the output layer is also parameter¬ 
ized). For recurrent neural networks, in which a signal 
may propagate through a layer more than once, the CAP 
is potentially unlimited in length. There is no universally 
agreed upon threshold of depth dividing shallow learning 
from deep learning, but most researchers in the field agree 
that deep learning has multiple nonlinear layers (CAP > 
2) and Schmidhuber considers CAP > 10 to be very deep 
learning. 141 (p7) 


62.1.2 Fundamental concepts 

Deep learning algorithms are based on distributed rep¬ 
resentations. The underlying assumption behind dis¬ 
tributed representations is that observed data is generated 
by the interactions of many different factors on different 
levels. Deep learning adds the assumption that these fac¬ 
tors are organized into multiple levels, corresponding to 
different levels of abstraction or composition. Varying 
numbers of layers and layer sizes can be used to provide 
different amounts of abstraction. 131 

Deep learning algorithms in particular exploit this idea 
of hierarchical explanatory factors. Different concepts 
are learned from other concepts, with the more abstract, 
higher level concepts being learned from the lower level 
ones. These architectures are often constructed with a 
greedy layer-by-layer method that models this idea. Deep 
learning helps to disentangle these abstractions and pick 
out which features are useful for learning. 131 

For supervised learning tasks where label information is 
readily available in training, deep learning promotes a 
principle which is very different than traditional meth¬ 
ods of machine learning. That is, rather than focusing 
on feature engineering which is often labor-intensive and 
varies from one task to another, deep learning methods 
are focused on end-to-end learning based on raw features. 
In other words, deep learning moves away from feature 
engineering to a maximal extent possible. To accomplish 
end-to-end optimization starting with raw features and 
ending in labels, layered structures are often necessary. 
From this perspective, we can regard the use of layered 
structures to derive intermediate representations in deep 
learning as a natural consequence of raw-feature-based 
end-to-end learning. 1 ' 1 Understanding the connection be¬ 
tween the above two aspects of deep learning is important 
to appreciate its use in several application areas, all in¬ 
volving supervised learning tasks (e.g., supervised speech 
and image recognition), as to be discussed in a later part 
of this article. 


Many deep learning algorithms are framed as unsuper¬ 
vised learning problems. Because of this, these algo¬ 
rithms can make use of the unlabeled data that supervised 
algorithms cannot. Unlabeled data is usually more abun¬ 
dant than labeled data, making this an important benefit 
of these algorithms. The deep belief network is an exam¬ 
ple of a deep structure that can be trained in an unsuper¬ 
vised manner. 131 


62.2 History 

Deep learning architectures, specifically those built from 
artificial neural networks (ANN), date back at least to 
the Neocognitron introduced by Kunihiko Fukushima in 
1980. [10] The ANNs themselves date back even further. 
In 1989, Yann LeCun et al. were able to apply the stan¬ 
dard backpropagation algorithm, which had been around 
since 1974, 1111 to a deep neural network with the purpose 
of recognizing handwritten ZIP codes on mail. Despite 
the success of applying the algorithm, the time to train 
the network on this dataset was approximately 3 days, 
making it impractical for general use. 1121 Many factors 
contribute to the slow speed, one being due to the so- 
called vanishing gradient problem analyzed in 1991 by 
Sepp Hochreiter. 11311141 

While such neural networks by 1991 were used for rec¬ 
ognizing isolated 2-D hand-written digits, 3-D object 
recognition by 1991 used a 3-D model-based approach 
- matching 2-D images with a handcrafted 3-D object 
model. Juyang Weng et al.. proposed that a human brain 
does not use a monolithic 3-D object model and in 1992 
they published Cresceptron, 1 151116,1171 a method for per¬ 
forming 3-D object recognition directly from cluttered 
scenes. Cresceptron is a cascade of many layers similar to 
Neocognitron. But unlike Neocognitron which required 
the human programmer to hand-merge features, Crescep¬ 
tron fully automatically learned an open number of un¬ 
supervised features in each layer of the cascade where 
each feature is represented by a convolution kernel. In 
addition, Cresceptron also segmented each learned ob¬ 
ject from a cluttered scene through back-analysis through 
the network. Max-pooling, now often adopted by deep 
neural networks (e.g., ImageNet tests), was first used in 
Cresceptron to reduce the position resolution by a factor 
of (2x2) to 1 through the cascade for better generaliza¬ 
tion. Because of a great lack of understanding how the 
brain autonomously wire its biological networks and the 
computational cost by ANNs then, simpler models that 
use task-specific handcrafted features such as Gabor fil¬ 
ter and support vector machines (SVMs) were of popular 
choice of the field in the 1990s and 2000s. 

In the long history of speech recognition, both shallow 
form and deep form (e.g., recurrent nets) of artificial neu¬ 
ral networks had been explored for many years. 1 ' 811 ' 911201 
But these methods never won over the non-uniform 
internal-handcrafting Gaussian mixture model/Hidden 
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Markov model (GMM-HMM) technology based on gen¬ 
erative models of speech trained discriminatively.' 21 * 
A number of key difficulties had been methodologi¬ 
cally analyzed, including gradient diminishing and weak 
temporal correlation structure in the neural predictive 
models.* 22 " 23 ' All these difficulties were in addition to 
the lack of big training data and big computing power 
in these early days. Most speech recognition researchers 
who understood such barriers hence subsequently moved 
away from neural nets to pursue generative modeling ap¬ 
proaches until the recent resurgence of deep learning that 
has overcome all these difficulties. Hinton et al. and 
Deng et al. reviewed part of this recent history about 
how their collaboration with each other and then with 
cross-group colleagues ignited the renaissance of neural 
networks and initiated deep learning research and appli¬ 
cations in speech recognition. 124 * 125112611271 

The term “deep learning” gained traction in the mid- 
2000s after a publication by Geoffrey Hinton and Ruslan 
Salakhutdinov showed how a many-layered feedforward 
neural network could be effectively pre-trained one layer 
at a time, treating each layer in turn as an unsupervised 
restricted Boltzmann machine, then using supervised 
backpropagation for fine-tuning.* 28 * In 1992, Schmidhu- 
ber had already implemented a very similar idea for the 
more general case of unsupervised deep hierarchies of 
recurrent neural networks, and also experimentally shown 
its benefits for speeding up supervised learning * 29 " 30 ' 

Since the resurgence of deep learning, it has become part 
of many state-of-the-art systems in different disciplines, 
particularly that of computer vision and automatic speech 
recognition (ASR). Results on commonly used evalua¬ 
tion sets such as TIMIT (ASR) and MNIST (image clas¬ 
sification) as well as a range of large vocabulary speech 
recognition tasks are constantly being improved with new 
applications of deep learning.* 24 ** 31 ** 32 * Currently, it has 
been shown that deep learning architectures in the form 
of convolutional neural networks have been nearly best 
performing;* 33 ** 34 * however, these are more widely used 
in computer vision than in ASR. 

The real impact of deep learning in industry started in 
large-scale speech recognition around 2010. In late 2009, 
Geoff Hinton was invited by Li Deng to work with him 
and colleagues at Microsoft Research in Redmond to 
apply deep learning to speech recognition. They co¬ 
organized the 2009 NIPS Workshop on Deep Learning 
for Speech Recognition. The workshop was motivated 
by the limitations of deep generative models of speech, 
and the possibility that the big-compute, big-data era 
warranted a serious try of the deep neural net (DNN) 
approach. It was then (incorrectly) believed that pre¬ 
training of DNNs using generative models of deep belief 
net (DBN) would be the cure for the main difficulties of 
neural nets encountered during 1990’s.* 26 * However, soon 
after the research along this direction started at Microsoft 
Research, it was discovered that when large amounts of 
training data are used and especially when DNNs are 


designed correspondingly with large, context-dependent 
output layers, dramatic error reduction occurred over the 
then-state-of-the-art GMM-HMM and more advanced 
generative model-based speech recognition systems with¬ 
out the need for generative DBN pre-training, the find¬ 
ing verified subsequently by several other major speech 
recognition research groups * 24 ** 35 * Further, the nature of 
recognition errors produced by the two types of systems 
was found to be characteristically different,* 25 '* 36 * offer¬ 
ing technical insights into how to artfully integrate deep 
learning into the existing highly efficient, run-time speech 
decoding system deployed by all major players in speech 
recognition industry. The history of this significant de¬ 
velopment in deep learning has been described and ana¬ 
lyzed in recent books.' 1 " 37 * 

Advances in hardware have also been an important en¬ 
abling factor for the renewed interest of deep learning. 
In particular, powerful graphics processing units (GPUs) 
are highly suited for the kind of number crunching, ma¬ 
trix/vector math involved in machine learning. GPUs 
have been shown to speed up training algorithms by or¬ 
ders of magnitude, bringing running times of weeks back 
to days.* 38 " 39 ' 

62.3 Deep learning in artificial 
neural networks 

Some of the most successful deep learning methods in¬ 
volve artificial neural networks. Artificial neural net¬ 
works are inspired by the 1959 biological model proposed 
by Nobel laureates David H. Hubei & Torsten Wiesel, 
who found two types of cells in the primary visual cortex: 
simple cells and complex cells. Many artificial neural net¬ 
works can be viewed as cascading models * 15 " 16 " 17 " 40 * of 
cell types inspired by these biological observations. 

Fukushima’s Neocognitron introduced convolutional 
neural networks partially trained by unsupervised learn¬ 
ing while humans directed features in the neural 
plane. Yann LeCun et al. (1989) applied super¬ 
vised backpropagation to such architectures.' 41 * Weng 
et al. (1992) published convolutional neural networks 
Cresceptron' 15 " 16 " 17 ' for 3-D object recognition from 
images of cluttered scenes and segmentation of such ob¬ 
jects from images. 

An obvious need for recognizing general 3-D ob¬ 
jects is least shift invariance and tolerance to defor¬ 
mation. Max-pooling appeared to be first proposed 
by Cresceptron' 15 " 16 ' to enable the network to tolerate 
small-to-large deformation in a hierarchical way while 
using convolution. Max-pooling helps, but still does not 
fully guarantee, shift-invariance at the pixel level.* 17 ' 

With the advent of the back-propagation algorithm in the 
1970s, many researchers tried to train supervised deep 
artificial neural networks from scratch, initially with little 


62.4. DEEP LEARNING ARCHITECTURES 


391 


success. Sepp Hochreiter's diploma thesis of 1991 1 421143 1 
formally identified the reason for this failure in the “van¬ 
ishing gradient problem,” which not only affects many¬ 
layered feedforward networks, but also recurrent neural 
networks. The latter are trained by unfolding them into 
very deep feedforward networks, where a new layer is cre¬ 
ated for each time step of an input sequence processed by 
the network. As errors propagate from layer to layer, they 
shrink exponentially with the number of layers. 

To overcome this problem, several methods were pro¬ 
posed. One is Jurgen Schmidhuber's multi-level hi¬ 
erarchy of networks (1992) pre-trained one level at a 
time through unsupervised learning, fine-tuned through 
backpropagation. 1291 Here each level learns a compressed 
representation of the observations that is fed to the next 
level. 

Another method is the long short term memory (LSTM) 
network of 1997 by Hochreiter & Schmidhuber. 1441 In 
2009, deep multidimensional LSTM networks won three 
ICDAR 2009 competitions in connected handwriting 
recognition, without any prior knowledge about the three 
different languages to be learned. 14511461 

Sven Behnke relied only on the sign of the gra¬ 
dient (Rprop) when training his Neural Abstraction 
Pyramid 1471 to solve problems like image reconstruction 
and face localization. 

Other methods also use unsupervised pre-training to 
structure a neural network, making it first learn generally 
useful feature detectors. Then the network is trained fur¬ 
ther by supervised back-propagation to classify labeled 
data. The deep model of Hinton et al. (2006) involves 
learning the distribution of a high level representation us¬ 
ing successive layers of binary or real-valued latent vari¬ 
ables. It uses a restricted Boltzmann machine (Smolen¬ 
sky, 1986 1481 ) to model each new layer of higher level 
features. Each new layer guarantees an increase on the 
lower-bound of the log likelihood of the data, thus im¬ 
proving the model, if trained properly. Once sufficiently 
many layers have been learned the deep architecture may 
be used as a generative model by reproducing the data 
when sampling down the model (an “ancestral pass”) 
from the top level feature activations. 1491 Hinton reports 
that his models are effective feature extractors over high¬ 
dimensional, structured data. 1501 

The Google Brain team led by Andrew Ng and Jeff Dean 
created a neural network that learned to recognize higher- 
level concepts, such as cats, only from watching unlabeled 
images taken from YouTube videos. 1511 1521 

Other methods rely on the sheer processing power of 
modern computers, in particular, GPUs. In 2010 it 
was shown by Dan Ciresan and colleagues 1381 in Jurgen 
Schmidhuber's group at the Swiss AI Lab IDSIA that 
despite the above-mentioned “vanishing gradient prob¬ 
lem,” the superior processing power of GPUs makes plain 
back-propagation feasible for deep feedforward neural 
networks with many layers. The method outperformed 


all other machine learning techniques on the old, famous 
MNIST handwritten digits problem of Yann LeCun and 
colleagues at NYU. 

At about the same time, in late 2009, deep learning made 
inroad into speech recognition, as marked by the NIPS 
Workshop on Deep Learning for Speech Recognition. In¬ 
tensive collaborative work between Microsoft Research 
and University of Toronto researchers had demonstrated 
by mid 2010 in Redmond that deep neural networks 
interfaced with a hidden Markov model with context- 
dependent states that define the neural network output 
layer can drastically reduce errors in large vocabulary 
speech recognition tasks such as voice search. The same 
deep neural net model was shown to scale up to Switch¬ 
board tasks about one year later at Microsoft Research 
Asia. 

As of 2011, the state of the art in deep learn¬ 
ing feedforward networks alternates convolutional lay¬ 
ers and max-pooling layers, 15311541 topped by several 
pure classification layers. Training is usually done 
without any unsupervised pre-training. Since 2011, 
GPU-based implementations 1531 of this approach won 
many pattern recognition contests, including the IJCNN 

2011 Traffic Sign Recognition Competition, 1551 the ISBI 

2012 Segmentation of neuronal structures in EM stacks 
challenge, 1561 and others. 

Such supervised deep learning methods also were the 
first artificial pattern recognizers to achieve human- 
competitive performance on certain tasks. 1571 

To break the barriers of weak AI represented by deep 
learning, it is necessary to go beyond the deep learn¬ 
ing architectures because biological brains use both shal¬ 
low and deep circuits as reported by brain anatomy 1581 
in order to deal with the wide variety of invariance that 
the brain displays. Weng 1591 argued that the brain self¬ 
wires largely according to signal statistics and, therefore, 
a serial cascade cannot catch all major statistical depen¬ 
dencies. Fully guaranteed shift invariance for ANNs to 
deal with small and large natural objects in large clut¬ 
tered scenes became true when the invariance went be¬ 
yond shift, to extend to all ANN-learned concepts, such 
as location, type (object class label), scale, lighting, in the 
Developmental Networks (DNs) 1601 whose embodiments 
are Where-What Networks, WWN-1 (2008) [61] through 
WWN-7 (2013). 1621 


62.4 Deep learning architectures 

There are huge number of different variants of deep ar¬ 
chitectures; however, most of them are branched from 
some original parent architectures. It is not always pos¬ 
sible to compare the performance of multiple architec¬ 
tures all together, since they are not all implemented on 
the same data set. Deep learning is a fast-growing field 
so new architectures, variants, or algorithms may appear 
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every few weeks. 

62.4.1 Deep neural networks 

A deep neural network (DNN) is an artificial neu¬ 
ral network with multiple hidden layers of units be¬ 
tween the input and output layers. 121141 Similar to shal¬ 
low ANNs, DNNs can model complex non-linear re¬ 
lationships. DNN architectures, e.g., for object detec¬ 
tion and parsing generate compositional models where 
the object is expressed as layered composition of image 
primitives. 1631 The extra layers enable composition of fea¬ 
tures from lower layers, giving the potential of modeling 
complex data with fewer units than a similarly performing 
shallow network. 121 

DNNs are typically designed as feedforward networks, 
but recent research has successfully applied the deep 
learning architecture to recurrent neural networks for ap¬ 
plications such as language modeling. 1641 Convolutional 
deep neural networks (CNNs) are used in computer vi¬ 
sion where their success is well-documented. 1651 More 
recently, CNNs have been applied to acoustic modeling 
for automatic speech recognition (ASR), where they have 
shown success over previous models. 1341 For simplicity, a 
look at training DNNs is given here. 

A DNN can be discriminatively trained with the standard 
backpropagation algorithm. The weight updates can be 
done via stochastic gradient descent using the following 
equation: 


Wij(t+ 1) = w ij (t) + r i—- 

Here, p is the learning rate, and C is the cost func¬ 
tion. The choice of the cost function depends on fac¬ 
tors such as the learning type (supervised, unsuper¬ 
vised, reinforcement, etc.) and the activation function. 
For example, when performing supervised learning on 
a multiclass classification problem, common choices for 
the activation function and cost function are the softmax 
function and cross entropy function, respectively. The 
softmax function is defined as pj = ^-- ex ^p( l k ) where pj 
represents the class probability and Xj and x t represent 
the total input to units j and k respectively. Cross en¬ 
tropy is defined as C = — JV dj log (Pj) where dj rep¬ 
resents the target probability for output unit j and pj is 
the probability output for j after applying the activation 
function. 1661 


62.4.2 Issues with deep neural networks 

As with ANNs, many issues can arise with DNNs if they 
are naively trained. Two common issues are overfitting 
and computation time. 


DNNs are prone to overfitting because of the added lay¬ 
ers of abstraction, which allow them to model rare de¬ 
pendencies in the training data. Regularization meth¬ 
ods such as weight decay ( I 2 -regularization) or sparsity 
( tx -regularization) can be applied during training to 
help combat overfitting. 1671 A more recent regularization 
method applied to DNNs is dropout regularization. In 
dropout, some number of units are randomly omitted 
from the hidden layers during training. This helps to 
break the rare dependencies that can occur in the training 
data 1681 

Backpropagation and gradient descent have been the pre¬ 
ferred method for training these structures due to the 
ease of implementation and their tendency to converge 
to better local optima in comparison with other training 
methods. However, these methods can be computation¬ 
ally expensive, especially when being used to train DNNs. 
There are many training parameters to be considered with 
a DNN, such as the size (number of layers and number 
of units per layer), the learning rate and initial weights. 
Sweeping through the parameter space for optimal pa¬ 
rameters may not be feasible due to the cost in time and 
computational resources. Various 'tricks’ such as using 
mini-batching (computing the gradient on several train¬ 
ing examples at once rather than individual examples) 1691 
have been shown to speed up computation. The large 
processing throughput of GPUs has produced significant 
speedups in training, due to the matrix and vector com¬ 
putations required being well suited for GPUs. 141 Radical 
alternatives to backprop such as Extreme Learning Ma¬ 
chines, 1701 “No-prop” networks 1711 and Weightless neural 
networks 1721 are gaining attention. 


62.4.3 Deep belief networks 

Main article: Deep belief network 
A deep belief network (DBN) is a probabilistic, 
generative model made up of multiple layers of hidden 
units. It can be looked at as a composition of simple learn¬ 
ing modules that make up each layer. 1731 

A DBN can be used for generatively pre-training a DNN 
by using the learned weights as the initial weights. Back- 
propagation or other discriminative algorithms can then 
be applied for fine-tuning of these weights. This is par¬ 
ticularly helpful in situations where limited training data 
is available, as poorly initialized weights can have signifi¬ 
cant impact on the performance of the final model. These 
pre-trained weights are in a region of the weight space that 
is closer to the optimal weights (as compared to just ran¬ 
dom initialization). This allows for both improved mod¬ 
eling capability and faster convergence of the fine-tuning 
phase. 1741 

A DBN can be efficiently trained in an unsupervised, 
layer-by-layer manner where the layers are typically made 
of restricted Boltzmann machines (RBM). A description 
of training a DBN via RBMs is provided below. An RBM 
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Hidden units 

Visible units 










A restricted Boltzmann machine (RBM) with fully connected visi¬ 
ble and hidden units. Note there are no hidden-hidden or visible- 
visible connections. 

is an undirected, generative energy-based model with an 
input layer and single hidden layer. Connections only ex¬ 
ist between the visible units of the input layer and the hid¬ 
den units of the hidden layer; there are no visible-visible 
or hidden-hidden connections. 

The training method for RBMs was initially proposed 
by Geoffrey Hinton for use with training “Product of 
Expert” models and is known as contrastive divergence 
(CD). 1751 CD provides an approximation to the maximum 
likelihood method that would ideally be applied for learn¬ 
ing the weights of the RBM. 16911761 

In training a single RBM, weight updates are performed 
with gradient ascent via the following equation: A Wij (£+ 
1) = Wij(t) + r) 9l0 g^ v ^ ■ Here, p{v) is the prob¬ 
ability of a visible vector, which is given by p(v) = 
1 Y^ h e~ E ^ v ' h ^ . Z is the partition function (used for 
normalizing) and E(v, h) is the energy function assigned 
to the state of the network. A lower energy indicates the 
network is in a more “desirable” configuration. The gradi- 
ent has the simple form (vihj)^-{vihj) modd 

where (• • • ) p represent averages with respect to distribu¬ 
tion p . The issue arises in sampling {v{hj) mo dei as this 
requires running alternating Gibbs sampling for a long 
time. CD replaces this step by running alternating Gibbs 
sampling for n steps (values of n = 1 have empirically 
been shown to perform well). After n steps, the data is 
sampled and that sample is used in place of (vihj) mo d e i . 
The CD procedure works as follows: 1 2 " 1 

1. Initialize the visible units to a training vector. 

2. Update the hidden units in parallel given the visible 
units: p(hj = 1 | V) = a(bj + J2i v i w ij) ■ a 


represents the sigmoid function and b :l is the bias of 

hj . 

3. Update the visible units in parallel given the hidden 
units: p(vi = 1 | H) = er(ai + ^j w ij) ■ a i 
is the bias of . This is called the “reconstruction” 
step. 

4. Reupdate the hidden units in parallel given the re¬ 
constructed visible units using the same equation as 
in step 2. 

5. Perform the weight update: Awij oc (vihj )data — 

Vi b'j ) reconstruction ■ 

Once an RBM is trained, another RBM can be “stacked” 
atop of it to create a multilayer model. Each time an¬ 
other RBM is stacked, the input visible layer is initialized 
to a training vector and values for the units in the already- 
trained RBM layers are assigned using the current weights 
and biases. The final layer of the already-trained layers is 
used as input to the new RBM. The new RBM is then 
trained with the procedure above, and then this whole 
process can be repeated until some desired stopping cri¬ 
terion is met. 121 

Despite the approximation of CD to maximum likelihood 
being very crude (CD has been shown to not follow the 
gradient of any function), empirical results have shown 
it to be an effective method for use with training deep 
architectures. 1691 

62.4.4 Convolutional neural networks 

Main article: Convolutional neural network 

A CNN is composed of one or more convolutional layers 
with fully connected layers (matching those in typical ar¬ 
tificial neural networks) on top. It also uses tied weights 
and pooling layers. This architecture allows CNNs to 
take advantage of the 2D structure of input data. In 
comparison with other deep architectures, convolutional 
neural networks are starting to show superior results in 
both image and speech applications. They can also be 
trained with standard backpropagation. CNNs are eas¬ 
ier to train than other regular, deep, feed-forward neural 
networks and have many fewer parameters to estimate, 
making them a highly attractive architecture to use. 1771 

62.4.5 Convolutional Deep Belief Net¬ 
works 

A recent achievement in deep learning is from the use of 
convolutional deep belief networks (CDBN). A CDBN 
is very similar to normal Convolutional neural network 
in terms of its structure. Therefore, like CNNs they are 
also able to exploit the 2D structure of images combined 
with the advantage gained by pre-training in Deep belief 
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network. They provide a generic structure which can be 
used in many image and signal processing tasks and can 
be trained in a way similar to that for Deep Belief Net¬ 
works. Recently, many benchmark results on standard 
image datasets like CIFAR 1781 have been obtained using 
CDBNs.' 79 ' 


62.4.6 Deep Boltzmann Machines 


A Deep Boltzmann Machine (DBM) is a type of bi¬ 
nary pairwise Markov random field (undirected proba¬ 
bilistic graphical models) with multiple layers of hidden 
random variables. It is a network of symmetrically cou¬ 
pled stochastic binary units. It comprises a set of visible 
units v G {0, l} 73 , and a series of layers of hidden units 
h (1) G {0, l}^ 1 , h {2) G {0, 1} F * ,..., h {L) G {0, 1} Fl 
. There is no connection between the units of the same 
layer (like RBM). For the DBM, we can write the proba¬ 
bility which is assigned to vector v as: 


pW) = \ £ 


V w (1) Vih (1) +Y W 

Z-zaj vv zj J ' Z—j jl vv ' 


( 2 ), ( 1 )^( 2 ) 
31 n 3 


+ 5Zlm W l 


( 3 ) 


where h = { h! 1 '. h' 2 ' 1 . li 2] } are the set of hidden units, 
and 9 = {W I ' 1 \W I ' 2 \W^} are the model parame¬ 
ters, representing visible-hidden and hidden-hidden sym¬ 
metric interaction, since they are undirected links. As it 
is clear by setting = 0 and = 0 the net¬ 

work becomes the well-known Restricted Boltzmann ma¬ 
chine. [80] 


There are several reasons which motivate us to take ad¬ 
vantage of deep Boltzmann machine architectures. Like 
DBNs, they benefit from the ability of learning com¬ 
plex and abstract internal representations of the input in 
tasks such as object or speech recognition, with the use 
of limited number of labeled data to fine-tune the rep¬ 
resentations built based on a large supply of unlabeled 
sensory input data. However, unlike DBNs and deep 
convolutional neural networks, they adopt the inference 
and training procedure in both directions, bottom-up and 
top-down pass, which enable the DBMs to better unveil 
the representations of the ambiguous and complex input 
structures,' 81 ' ,' 82 ' 

Since the exact maximum likelihood learning is intractable 
for the DBMs, we may perform the approximate max¬ 
imum likelihood learning. There is another possibility, 
to use mean-field inference to estimate data-dependent 
expectations, incorporation with a Markov chain Monte 
Carlo (MCMC) based stochastic approximation technique 
to approximate the expected sufficient statistics of the 
model.' 80 ] 


We can see the difference between DBNs and DBM. In 
DBNs, the top two layers form a restricted Boltzmann 
machine which is an undirected graphical model, but the 
lower layers form a directed generative model. 


formance and functionality of this kind of architecture. 
The approximate inference, which is based on mean- 
field method, is about 25 to 50 times slower than a sin¬ 
gle bottom-up pass in DBNs. This time consuming task 
make the joint optimization, quite impractical for large 
data sets, and seriously restricts the use of DBMs in tasks 
such as feature representations (the mean-field inference 
have to be performed for each new test input).' 83 ' 


62.4.7 Stacked (Denoising) Auto-Encoders 

The auto encoder idea is motivated by the concept of good 
representation. For instance for the case of classifier it is 
possible to define that a good representation is one that will 
yield a better performing classifier. 

An encoder is referred to a deterministic mapping fg that 
transforms an input vector x into hidden representation 
y, where 9 = {W, b] , W is the weight matrix and b is 
an offset vector (bias). On the contrary a decoder maps 
back the hidden representation y to the reconstructed in- 
put z™ via ijg . The whole process of auto encoding is to 
compare this reconstructed input to the original and try 
to minimize this error to make the reconstructed value as 
close as possible to the original. 

In stacked denoising auto encoders, the partially corrupted 
output is cleaned ( demised ). This fact has been intro¬ 
duced in |84 ' with a specific approach to good represen¬ 
tation, a good representation is one that can be obtained 
robustly from a corrupted input and that will be useful for 
recovering the corresponding clean input. Implicit in this 
definition are the ideas of 

• The higher level representations are relatively stable 
and robust to the corruption of the input; 

• It is required to extract features that are useful for 
representation of the input distribution. 

The algorithm consists of multiple steps; starts by a 
stochastic mapping of x to x through qn(x\x) , this is 
the corrupting step. Then the corrupted input x passes 
through a basic auto encoder process and is mapped to a 
hidden representation y = fg(x) = s(Wx + b) . From 
this hidden representation we can reconstruct z = gg (y ) 

. In the last stage a minimization algorithm is done in or¬ 
der to have a z as close as possible to uncorrupted input x 
. The reconstruction error Lh{x , z) might be either the 
cross-entropy loss with an affine-sigmoid decoder, or the 
squared error loss with an affine decoder.' 84 ' 

In order to make a deep architecture, auto encoders stack 
one on top of another. Once the encoding function fg 
of the first denoising auto encoder is learned and used to 
uncorrupt the input (corrupted input), we can train the 
second level.' 84 ' 


Apart from all the advantages of DBMs discussed so far. Once the stacked auto encoder is trained, its output might 
they have a crucial disadvantage which limits the per- be used as the input to a supervised learning algorithm 
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such as support vector machine classifier or a multiclass 
logistic regression. 1841 


62.4.8 Deep Stacking Networks 

One of the deep architectures recently introduced in 1851 
which is based on building hierarchies with blocks of 
simplified neural network modules, is called deep con¬ 
vex network. They are called “convex” because of the 
formulation of the weights learning problem, which is 
a convex optimization problem with a closed-form solu¬ 
tion. The network is also called the deep stacking network 
(DSN l, 1861 emphasizing on this fact that a similar mecha¬ 
nism as the stacked generalization is used. 1871 

The DSN blocks, each consisting of a simple, easy-to- 
learn module, are stacked to form the overall deep net¬ 
work. It can be trained block-wise in a supervised fash¬ 
ion without the need for back-propagation for the entire 
blocks. 1881 

As designed in 1851 each block consists of a simplified 
MLP with a single hidden layer. It comprises a weight 
matrix U as the connection between the logistic sigmoidal 
units of the hidden layer h to the linear output layer y, 
and a weight matrix W which connects each input of the 
blocks to their respective hidden layers. If we assume 
that the target vectors t be arranged to form the columns 
of T (the target matrix), let the input data vectors x be 
arranged to form the columns of X, let H = <j{W T X) 
denote the matrix of hidden units, and assume the lower- 
layer weights W are known (training layer-by-layer). The 
function performs the element-wise logistic sigmoid op¬ 
eration. Then learning the upper-layer weight matrix U 
given other weights in the network can be formulated as 
a convex optimization problem: 


min f = \\U t H~T\\1, 

JJT 

which has a closed-form solution. The input to the first 
block X only contains the original data, however in the 
upper blocks in addition to this original (raw) data there 
is a copy of the lower-block(s) output y. 

In each block an estimate of the same final label class y 
is produced, then this estimated label concatenated with 
original input to form the expanded input for the upper 
block. In contrast with other deep architectures, such as 
DBNs, the goal is not to discover the transformed feature 
representation. Regarding the structure of the hierarchy 
of this kind of architecture, it makes the parallel training 
straightforward as the problem is naturally a batch-mode 
optimization one. In purely discriminative tasks DSN 
performance is better than the conventional DBN. |xfi| 


62.4.9 Tensor Deep Stacking Networks (T- 
DSN) 

This architecture is an extension of the DSN. It improves 
the DSN in two important ways, using the higher order 
information by means of covariance statistics and trans¬ 
forming the non-convex problem of the lower-layer to a 
convex sub-problem of the upper-layer. 1891 

Unlike the DSN, the covariance statistics of the data is 
employed using a bilinear mapping from two distinct sets 
of hidden units in the same layer to predictions via a third- 
order tensor. 

The scalability and parallelization are the two important 
factors in the learning algorithms which are not consid¬ 
ered seriously in the conventional DNNs. 190119111921 All 
the learning process for the DSN (and TDSN as well) is 
done on a batch-mode basis so as to make the paralleliza¬ 
tion possible on a cluster of CPU or GPU nodes. 18511861 
Parallelization gives the opportunity to scale up the de¬ 
sign to larger (deeper) architectures and data sets. 

The basic architecture is suitable for diverse tasks such as 
classification and regression. 

62.4.10 Spike-and-Slab RBMs (ssRBMs) 

The need for real-valued inputs which are employed in 
Gaussian RBMs (GRBMs), motivates scientists seeking 
new methods. One of these methods is the spike and slab 
RBM (ssRBMs), which models continuous-valued inputs 
with strictly binary latent variables. 1931 

Similar to basic RBMs and its variants, the spike and 
slab RBM is a bipartite graph. Like GRBM, the visi¬ 
ble units (input) are real-valued. The difference arises 
in the hidden layer, where each hidden unit come along 
with a binary spike variable and real-valued slab vari¬ 
able. These terms (spike and slab) come from the statis¬ 
tics literature, 1941 and refer to a prior including a mixture 
of two components. One is a discrete probability mass at 
zero called spike, and the other is a density over continu¬ 
ous domain. 19511951 

There is also an extension of the ssRBM model, which 
is called p-ssRBM. This variant provides extra modeling 
capacity to the architecture using additional terms in the 
energy function. One of these terms enable model to form 
a conditional distribution of the spike variables by means 
of marginalizing out the slab variables given an observa¬ 
tion. 


62.4.11 Compound Hierarchical-Deep 
Models 

The class architectures called compound HD models, 
where HD stands for Hierarchical-Deep are structured as 
a composition of non-parametric Bayesian models with 
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deep networks. The features, learned by deep architec¬ 
tures such as DBNs, 1961 DBMs, 1811 deep auto encoders, 1971 
convolutional variants, 19811 " 1 ssRBMs, 1951 deep coding 
network, 11001 DBNs with sparse feature learning, 11011 re¬ 
cursive neural networks, 11021 conditional DBNs, 11031 de¬ 
noting auto encoders, 11041 are able to provide better rep¬ 
resentation for more rapid and accurate classification 
tasks with high-dimensional training data sets. However, 
they are not quite powerful in learning novel classes with 
few examples, themselves. In these architectures, all units 
through the network are involved in the representation of 
the input (distributed representations ), and they have to be 
adjusted together (high degree of freedom). However, if 
we limit the degree of freedom, we make it easier for the 
model to learn new classes out of few training samples 
(less parameters to learn). Hierarchical Bayesian (HB) 
models, provide learning from few examples, for example 
[105] [106] [107] [108] [109] f or com p U [ e r vision, statistics, and 
cognitive science. 


tion is used to empirically adjust the priors needed for 
the bottom-up inference procedure by means of a deep 
locally connected generative model. This is based on ex¬ 
tracting sparse features out of time-varying observations 
using a linear dynamical model. Then, a pooling strategy 
is employed in order to learn invariant feature represen¬ 
tations. Similar to other deep architectures, these blocks 
are the building elements of a deeper architecture where 
greedy layer-wise unsupervised learning are used. Note 
that the layers constitute a kind of Markov chain such that 
the states at any layer are only dependent on the succeed¬ 
ing and preceding layers. 

Deep predictive coding network (DPCN) 1 1111 predicts the 
representation of the layer, by means of a top-down ap¬ 
proach using the information in upper layer and also tem¬ 
poral dependencies from the previous states, it is called 

It is also possible to extend the DPCN to form a 
convolutional network. 11111 


Compound HD architectures try to integrate both charac¬ 
teristics of HB and deep networks. The compound HDP- 
DBM architecture, a hierarchical Dirichletprocess (HDP) 
as a hierarchical model, incorporated with DBM archi¬ 
tecture. It is a full generative model, generalized from 
abstract concepts flowing through the layers of the model, 
which is able to synthesize new examples in novel classes 
that look reasonably natural. Note that all the levels 
are learned jointly by maximizing a joint log-probability 


Consider a DBM with three hidden layers, the probability 
of a visible input v is: 


= ^Y,h e ij 


V.. W' 


( 1 ) 


w£>h 




N +Elm 


(3) 


where h = { h! 1 '. h ’ 2 ' 1 . li 3] } arc the set of hidden units, 
and t/j = {VK 1 - 11 , VC 121 , VK 131 } are the model parame¬ 
ters, representing visible-hidden and hidden-hidden sym¬ 
metric interaction terms. 


After a DBM model has been learned, we have an 
undirected model that defines the joint distribution 
P(v, h 1 , h 2 , h 3 ) . One way to express what has been 
learned is the conditional model P(v, h 1 , h 2 \h 3 ) and a 
prior term P(h 3 ) . 


62.4.13 Multilayer Kernel Machine 

The Multilayer Kernel Machine (MKM) as introduced in 
11121 is a way of learning highly nonlinear functions with 
the iterative applications of weakly nonlinear kernels. 
They use the kernel principal component analysis (KPCA ), 
in, 11131 as method for unsupervised greedy layer-wise pre¬ 
training step of the deep learning architecture. 

Layer l + 1 -th learns the representation of the previous 
layer l , extracting the m principal component (PC) of the 
projection layer l output in the feature domain induced 
h p))B the kernel. For the sake of dimensionality reduction 
of the updated representation in each layer, a supervised 
strategy is proposed to select the best informative features 
among the ones extracted by KPCA. The process is: 

• ranking the ni features according to their mutual in¬ 
formation with the class labels; 

• for different values of K and mi € {1,..., rq} , 
compute the classification error rate of a K-nearest 
neighbor (K-NN) classifier using only the mi most 
informative features on a validation set; 


The part P(v, h 3 ,h 2 \h 3 ), represents a conditional DBM 
model, which can be viewed as a two-layer DBM but with 
bias terms given by the states of h 3 : 

v.. w^Uitd.+T 

o z—t n % .7 1 i 


P{v, h 1 , /i 2 |/i 3 ) = 


Z(ip,h 3 )' 




+E, 


• the value of mi with which the classifier has reached 
the lowest error rate determines the number of fea¬ 
tures to retain. 

There are some drawbacks in using the KPCA method as 
the building cells of an MKM. 


62.4.12 Deep Coding Networks 

There are several advantages to having a model which can 
actively update itself to the context in data. One of these 
methods arises from the idea to have a model which is 
able to adjust its prior knowledge dynamically according 
to the context of the data. Deep coding network (DPCN) 
is a predictive coding scheme where top-down informa¬ 


Another, more straightforward method of integrating ker¬ 
nel machine into the deep learning architecture was de¬ 
veloped by Microsoft researchers for spoken language un¬ 
derstanding applications. 11141 The main idea is to use a 
kernel machine to approximate a shallow neural net with 
an infinite number of hidden units, and then to use the 
stacking technique to splice the output of the kernel ma¬ 
chine and the raw input in building the next, higher level 
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of the kernel machine. The number of the levels in this 
kernel version of the deep convex network is a hyper¬ 
parameter of the overall system determined by cross val¬ 
idation. 

62.4.14 Deep Q-Networks 

This is the latest class of deep learning models targeted 
for reinforcement learning, published in February 2015 
in Nature 11151 The application discussed in this paper is 
limited to ATARI gaming, but the implications for other 
potential applications are profound. 

62.4.15 Memory networks 

Integrating external memory component with artificial 
neural networks has a long history dating back to early 
research in distributed representations 11161 and self¬ 
organizing maps. E.g. in sparse distributed memory or 
HTM the patterns encoded by neural networks are used 
as memory addresses for content-addressable memory, 
with “neurons” essentially serving as address encoders 
and decoders. 

In the 1990s and 2000s, there was a lot of related work 
with differentiable long-term memories. For example: 

• Differentiable push and pop actions for alterna¬ 
tive memory networks called neural stack ma- 
c/zinay 1 ' 171 11181 

• Memory networks where the control network’s ex¬ 
ternal differentiable storage is in the fast weights of 
another network 11191 

• The LSTM “forget gates” 11201 

• Self-referential RNNs with special output units for 
addressing and rapidly manipulating each of the 
RNN’s own weights in differentiable fashion (so the 
external storage is actually internal) 1 121 J L 122 l 

More recently deep learning was shown to be useful in 
semantic hashing 1 1231 where a deep graphical model the 
word-count vectors 11241 obtained from a large set of doc¬ 
uments. Documents are mapped to memory addresses in 
such a way that semantically similar documents are lo¬ 
cated at nearby addresses. Documents similar to a query 
document can then be found by simply accessing all the 
addresses that differ by only a few bits from the address 
of the query document. 

Neural Turing Machines 11251 developed by Google Deep- 
Mind extend the capabilities of deep neural networks by 
coupling them to external memory resources, which they 
can interact with by attentional processes. The combined 


system is analogous to a Turing Machine but is differ¬ 
entiable end-to-end, allowing it to be efficiently trained 
with gradient descent. Preliminary results demonstrate 
that Neural Turing Machines can infer simple algorithms 
such as copying, sorting, and associative recall from input 
and output examples. 

Memory Networks 11261 is another extension to neural net¬ 
works incorporating long-term memory which was devel¬ 
oped by Facebook research. The long-term memory can 
be read and written to, with the goal of using it for pre¬ 
diction. These models have been applied in the context 
of question answering (QA) where the long-term mem¬ 
ory effectively acts as a (dynamic) knowledge base, and 
the output is a textual response. 


62.5 Applications 

62.5.1 Automatic speech recognition 

The results shown in the table below are for automatic 
speech recognition on the popular TIMIT data set. This 
is a common data set used for initial evaluations of deep 
learning architectures. The entire set contains 630 speak¬ 
ers from eight major dialects of American English, with 
each speaker reading 10 different sentences. 11271 Its small 
size allows many different configurations to be tried ef¬ 
fectively with it. More importantly, the TIMIT task con¬ 
cerns phone-sequence recognition, which, unlike word- 
sequence recognition, permits very weak “language mod¬ 
els” and thus the weaknesses in acoustic modeling aspects 
of speech recognition can be more easily analyzed. It 
was such analysis on TIMIT contrasting the GMM (and 
other generative models of speech) vs. DNN models car¬ 
ried out by Li Deng and collaborators around 2009-2010 
that stimulated early industrial investment on deep learn¬ 
ing technology for speech recognition from small to large 
scales, 12511361 eventually leading to pervasive and domi¬ 
nant uses of deep learning in speech recognition indus¬ 
try. That analysis was carried out with comparable per¬ 
formance (less than 1.5% in error rate) between discrimi¬ 
native DNNs and generative models. The error rates pre¬ 
sented below, including these early results and measured 
as percent phone error rates (PER), have been summa¬ 
rized over a time span of the past 20 years: 

Extension of the success of deep learning from TIMIT 
to large vocabulary speech recognition occurred in 2010 
by industrial researchers, where large output layers of 
the DNN based on context dependent HMM states con¬ 
structed by decision trees were adopted. 1130111311 See 
comprehensive reviews of this development and of the 
state of the art as of October 2014 in the recent Springer 
book from Microsoft Research. 1371 See also the related 
background of automatic speech recognition and the im¬ 
pact of various machine learning paradigms including no¬ 
tably deep learning in a recent overview article. 11321 
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One fundamental principle of deep learning is to do away 
with hand-crafted feature engineering and to use raw fea¬ 
tures. This principle was first explored successfully in the 
architecture of deep autoencoder on the “raw” spectro¬ 
gram or linear filter-bank features, 1131 * showing its su¬ 
periority over the Mel-Cepstral features which contain 
a few stages of fixed transformation from spectrograms. 
The true “raw” features of speech, waveforms, have more 
recently been shown to produce excellent larger-scale 
speech recognition results.* 134 * 

Since the initial successful debut of DNNs for speech 
recognition around 2009-2011, there has been huge 
progress made. This progress (as well as future direc¬ 
tions) has been summarized into the following eight ma¬ 
jor areas:* 1 ** 27 ** 37 * 1) Scaling up/out and speedup DNN 
training and decoding; 2) Sequence discriminative train¬ 
ing of DNNs; 3) Feature processing by deep models 
with solid understanding of the underlying mechanisms; 

4) Adaptation of DNNs and of related deep models; 

5) Multi-task and transfer learning by DNNs and re¬ 
lated deep models; 6) Convolution neural networks and 
how to design them to best exploit domain knowledge 
of speech; 7) Recurrent neural network and its rich 
LSTM variants; 8) Other types of deep models includ¬ 
ing tensor-based models and integrated deep genera¬ 
tive/discriminative models. 

Large-scale automatic speech recognition is the first and 
the most convincing successful case of deep learning in 
the recent history, embraced by both industry and aca¬ 
demic across the board. Between 2010 and 2014, the 
two major conferences on signal processing and speech 
recognition, IEEE-ICASSP and Interspeech, have seen 
near exponential growth in the numbers of accepted pa¬ 
pers in their respective annual conference papers on the 
topic of deep learning for speech recognition. More im¬ 
portantly, all major commercial speech recognition sys¬ 
tems (e.g., Microsoft Cortana, Xbox, Skype Translator, 
Google Now, Apple Siri, Baidu and iFlyTek voice search, 
and a range of Nuance speech products, etc.) nowa¬ 
days are based on deep learning methods.* 1 ** 135 ** 136 * See 
also the recent media interview with the CTO of Nuance 
Communications. * 13 7 * 

The wide-spreading success in speech recognition 
achieved by 2011 was followed shortly by large-scale im¬ 
age recognition described next. 


62 . 5.2 Image recognition 

A common evaluation set for image classification is the 
MNIST database data set. MNIST is composed of hand¬ 
written digits and includes 60000 training examples and 
10000 test examples. Similar to TIMIT, its small size 
allows multiple configurations to be tested. A compre¬ 
hensive list of results on this set can be found in.* 138 * The 
current best result on MNIST is an error rate of 0.23%, 
achieved by Ciresan et al. in 2012. * 139 * 


The real impact of deep learning in image or object 
recognition, one major branch of computer vision, was 
felt in the fall of 2012 after the team of Geoff Hinton and 
his students won the large-scale ImageNet competition by 
a significant margin over the then-state-of-the-art shallow 
machine learning methods. The technology is based on 
20-year-old deep convolutional nets, but with much larger 
scale on a much larger task, since it had been learned 
that deep learning works quite well on large-scale speech 
recognition. In 2013 and 2014, the error rate on the Im¬ 
ageNet task using deep learning was further reduced at a 
rapid pace, following a similar trend in large-scale speech 
recognition. 

As in the ambitious moves from automatic speech recog¬ 
nition toward automatic speech translation and under¬ 
standing, image classification has recently been extended 
to the more ambitious and challenging task of automatic 
image captioning, in which deep learning is the essential 
underlying technology. * 140 * * 141 * * 142 * * 143 * 

One example application is a car computer said to be 
trained with deep learning, which may be able to let cars 
interpret 360° camera views.* 144 * 

62.5.3 Natural language processing 

Neural networks have been used for implementing 
language models since the early 2000s. 1 1451 Key tech¬ 
niques in this field are negative sampling* 146 * and word 
embedding. A word embedding, such as word2vec, can 
be thought of as a representational layer in a deep learn¬ 
ing architecture transforming an atomic word into a po¬ 
sitional representation of the word relative to other words 
in the dataset; the position is represented as a point in a 
vector space. Using a word embedding as an input layer 
to a recursive neural network (RNN) allows for the train¬ 
ing of the network to parse sentences and phrases using an 
effective compositional vector grammar. A compositional 
vector grammar can be thought of as probabilistic context 
free grammar (PCFG) implemented by a recursive neu¬ 
ral network.* 147 * Recursive autoencoders built atop word 
embeddings have been trained to assess sentence simi¬ 
larity and detect paraphrasing.* 147 ’ Deep neural architec¬ 
tures have achieved state-of-the-art results in many tasks 
in natural language processing, such as constituency pars¬ 
ing,* 148 * sentiment analysis,* 149 * information retrieval,* 150 * 
* 151 * machine translation, * 152 * * 153 * contextual entity Unk¬ 
ing, * 154 * and other areas of NLP. * 155 * 

62.5.4 Drug discovery and toxicology 

The pharmaceutical industry faces the problem that a 
large percentage of candidate drugs fail to reach the mar¬ 
ket. These failures of chemical compounds are caused by 
insufficient efficacy on the biomolecular target (on-target 
effect), undetected and undesired interactions with other 
biomolecules (off-target effects), or unanticipated toxic 
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effects. [ 156 lt 157 l In 2012 a team led by George Dahl won 
the “Merck Molecular Activity Challenge” using multi¬ 
task deep neural networks to predict the biomolecular tar¬ 
get of a compound. 1158111591 In 2014 Sepp Hochreiter’s 
group used Deep Learning to detect off-target and toxic 
effects of environmental chemicals in nutrients, house¬ 
hold products and drugs and won the “Tox21 Data Chal¬ 
lenge” of NIH, FDA and NCATS. [160][161] These im¬ 
pressive successes show Deep Learning may be superior 
to other virtual screening methods. 1162111631 Researchers 
from Google and Stanford enhanced Deep Learning for 
drug discovery by combining data from a variety of 
sources. 11641 


62.5.5 Customer relationship manage¬ 
ment 

Recently success has been reported with application of 
deep reinforcement learning in direct marketing settings, 
illustrating suitability of the method for CRM automa¬ 
tion. A neural network was used to approximate the value 
of possible direct marketing actions over the customer 
state space, defined in terms of RFM variables. The esti¬ 
mated value function was shown to have a natural inter¬ 
pretation as CLV (customer lifetime value). 11651 

62.6 Deep learning in the human 
brain 

Computational deep learning is closely related to a class 
of theories of brain development (specifically, neocorti- 
cal development) proposed by cognitive neuroscientists 
in the early 1990s. 1166 ' An approachable summary of this 
work is Elman, et al.'s 1996 book “Rethinking Innate¬ 
ness” 1167 (see also: Shrager and Johnson;' 168 ' Quartz and 
Sejnowski ' 169 '). As these developmental theories were 
also instantiated in computational models, they are tech¬ 
nical predecessors of purely computationally-motivated 
deep learning models. These developmental models share 
the interesting property that various proposed learning 
dynamics in the brain (e.g., a wave of nerve growth factor) 
conspire to support the self-organization of just the sort of 
inter-related neural networks utilized in the later, purely 
computational deep learning models; and such computa¬ 
tional neural networks seem analogous to a view of the 
brain’s neocortex as a hierarchy of filters in which each 
layer captures some of the information in the operating 
environment, and then passes the remainder, as well as 
modified base signal, to other layers further up the hi¬ 
erarchy. This process yields a self-organizing stack of 
transducers, well-tuned to their operating environment. 
As described in The New York Times in 1995: "...the 
infant’s brain seems to organize itself under the influ¬ 
ence of waves of so-called trophic-factors ... different 
regions of the brain become connected sequentially, with 


one layer of tissue maturing before another and so on un¬ 
til the whole brain is mature.” ' 1701 

The importance of deep learning with respect to the evo¬ 
lution and development of human cognition did not es¬ 
cape the attention of these researchers. One aspect of 
human development that distinguishes us from our near¬ 
est primate neighbors may be changes in the timing of 
development. 1 1711 Among primates, the human brain re¬ 
mains relatively plastic until late in the post-natal pe¬ 
riod, whereas the brains of our closest relatives are more 
completely formed by birth. Thus, humans have greater 
access to the complex experiences afforded by being 
out in the world during the most formative period of 
brain development. This may enable us to “tune in” to 
rapidly changing features of the environment that other 
animals, more constrained by evolutionary structuring of 
their brains, are unable to take account of. To the ex¬ 
tent that these changes are reflected in similar timing 
changes in hypothesized wave of cortical development, 
they may also lead to changes in the extraction of infor¬ 
mation from the stimulus environment during the early 
self-organization of the brain. Of course, along with this 
flexibility comes an extended period of immaturity, dur¬ 
ing which we are dependent upon our caretakers and our 
community for both support and training. The theory 
of deep learning therefore sees the coevolution of cul¬ 
ture and cognition as a fundamental condition of human 
evolution.' 172 ' 


62.7 Commercial activities 

Deep learning is often presented as a step towards re¬ 
alising strong Al' 1 73 ' and thus many organizations have 
become interested in its use for particular applications. 
Most recently, in December 2013, Facebook announced 
that it hired Yann LeCun to head its new artificial intel¬ 
ligence (AI) lab that will have operations in California, 
London, and New York. The AI lab will be used for 
developing deep learning techniques that will help Face- 
book do tasks such as automatically tagging uploaded pic¬ 
tures with the names of the people in them.' 1741 

In March 2013, Geoffrey Hinton and two of his graduate 
students, Alex Krizhevsky and Ilya Sutskever, were hired 
by Google. Their work will be focused on both improv¬ 
ing existing machine learning products at Google and also 
help deal with the growing amount of data Google has. 
Google also purchased Hinton’s company, DNNresearch. 

In 2014 Google also acquired DeepMind Technologies, a 
British start-up that developed a system capable of learn¬ 
ing how to play Atari video games using only raw pixels 
as data input. 

Also in 2014, Microsoft established The Deep Learning 
Technology Center in its MSR division, amassing deep 
learning experts for application-focused activities. 

And Baidu hired Andrew Ng to head their new Silicon 
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Valley based research lab focusing on deep learning. 


62.8 Criticism and comment 

Given the far-reaching implications of artificial intelli¬ 
gence coupled with the realization that deep learning is 
emerging as one of its most powerful techniques, the sub¬ 
ject is understandably attracting both criticism and com¬ 
ment, and in some cases from outside the field of com¬ 
puter science itself. 

A main criticism of deep learning concerns the lack 
of theory surrounding many of the methods. Most of 
the learning in deep architectures is just some form of 
gradient descent. While gradient descent has been under¬ 
stood for a while now, the theory surrounding other algo¬ 
rithms, such as contrastive divergence is less clear (i.e.. 
Does it converge? If so, how fast? What is it approxi¬ 
mating?). Deep learning methods are often looked at as 
a black box, with most confirmations done empirically, 
rather than theoretically. 

Others point out that deep learning should be looked 
at as a step towards realizing strong AI, not as an all- 
encompassing solution. Despite the power of deep learn¬ 
ing methods, they still lack much of the functionality 
needed for realizing this goal entirely. Research psychol¬ 
ogist Gary Marcus has noted that: 

“Realistically, deep learning is only part of the larger chal¬ 
lenge of building intelligent machines. Such techniques 
lack ways of representing causal relationships (...) have 
no obvious ways of performing logical inferences, and 
they are also still a long way from integrating abstract 
knowledge, such as information about what objects are, 
what they are for, and how they are typically used. The 
most powerful A.I. systems, like Watson (...) use tech¬ 
niques like deep learning as just one element in a very 
complicated ensemble of techniques, ranging from the 
statistical technique of Bayesian inference to deductive 
reasoning.” 11751 

To the extent that such a viewpoint implies, without in¬ 
tending to, that deep learning will ultimately constitute 
nothing more than the primitive discriminatory levels of a 
comprehensive future machine intelligence, a recent pair 
of speculations regarding art and artificial intelligence 1 1761 
offers an alternative and more expansive outlook. The 
first such speculation is that it might be possible to train 
a machine vision stack to perform the sophisticated task 
of discriminating between “old master” and amateur fig¬ 
ure drawings; and the second is that such a sensitivity 
might in fact represent the rudiments of a non-trivial 
machine empathy. It is suggested, moreover, that such 
an eventuality would be in line with both anthropology, 
which identifies a concern with aesthetics as a key element 
of behavioral modernity, and also with a current school 
of thought which suspects that the allied phenomenon 
of consciousness, formerly thought of as a purely high- 


order phenomenon, may in fact have roots deep within 
the structure of the universe itself. 

In further reference to the idea that a significant degree of 
artistic sensitivity might inhere within relatively low lev¬ 
els, whether biological or digital, of the cognitive hierar¬ 
chy, there has recently been published a series of graphic 
representations of the internal states of deep (20-30 lay¬ 
ers) neural networks attempting to discern within essen¬ 
tially random data the images on which they have been 
trained, 11771 and these show a striking degree of what 
can only be described as visual creativity. This work, 
moreover, has captured a remarkable level of public at¬ 
tention, with the original research notice receiving well 
in excess of one thousand comments, and The Guardian 
coverage 11781 achieving the status of most frequently ac¬ 
cessed article on that newspaper’s web site. 

Some currently popular and successful deep learning ar¬ 
chitectures display certain problematical behaviors 11791 
(e.g. confidently classifying random data as belonging to 
a familiar category of nonrandom images; 11801 and mis- 
classifying miniscule perturbations of correctly classified 
images |1811 ). The creator of OpenCog, Ben Goertzel hy¬ 
pothesized 11791 that these behaviors are tied with lim¬ 
itations in the internal representations learned by these 
architectures, and that these same limitations would in¬ 
hibit integration of these architectures into heterogeneous 
multi-component AGI architectures. It is suggested that 
these issues can be worked around by developing deep 
learning architectures that internally form states homolo¬ 
gous to image-grammar 11821 decompositions of observed 
entities and events. 11791 Learning a grammar (visual or 
linguistic) from training data would be equivalent to re¬ 
stricting the system to commonsense reasoning which op¬ 
erates on concepts in terms of production rules of the 
grammar, and is a basic goal of both human language 
acquisition 11831 and A.I. (Also see Grammar induction 

[ 184 ]-, 

62.9 Deep learning software li¬ 
braries 

• Torch - An open source software library for machine 
learning based on the Lua programming language. 

• Theano - An open source machine learning library 
for Python. 

• H20.ai - An open source machine learning platform 
written in Java with a parallel architecture. 

• Deeplearning4j - An open source deep learning li- 
bray written for Java. It provides parallelization with 
CPUs and GPUs. 

• OpenNN - An open source C++ library which im¬ 
plements deep neural networks and provides paral¬ 
lelization with CPUs. 


62.11. REFERENCES 


401 


• NVIDIA cuDNN - A GPU-accelerated library of 
primitives for deep neural networks. 

• DeepLearnToolbox - A Matlab/Octave toolbox for 
deep learning. 

• convnetjs - A Javascript library for training deep 
learning models. It contains online demos. 

• Gensim - A toolkit for natural language processing 
implemented in the Python programming language. 

• Caffe - A deep learning framework . 

• Apache SINGA 11851 - A deep learning platform de¬ 
veloped for scalability, usability and extensibility. 


62.10 See also 

• Sparse coding 

• Compressed Sensing 

• Connectionism 

• Self-organizing map 

• Applications of artificial intelligence 

• List of artificial intelligence projects 

• Reservoir computing 

• Liquid state machine 

• Echo state network 
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Imroy, CALR, Noisy, Rich Farmbrough, Guanabot, NeuronExMachina, Ericamick, Xezbeth, MisterSheik, Crunchy Frog, El C, Spoon!, 
Simon South, Bobo 192, Smalljim, Rbj, Maurreen, Nothingmuch, Photonique, Andrewbadr, Haham hanuka, Pearle, Mpeisenbr, Mdd, 
Msh210, Uncle Bill, Pouya, BryanD, PAR, Cbumett, Jheald, Geraldshieldsl 1, Kusma, DV8 2XL, Oleg Alexandrov, FrancisTyers, Velho, 
Woohookitty, Linas, Mindmatrix, RuudKoot, Eatsaq, Eyreland, Seventy Three, Kanenas, Graham87, Josh Parris, Mayumashu, Sudo Monas, 
Arunkumar, HappyCamper, Alejo2083, Chris Pressey, Mathbot, Annacoder, Nabarry, Srleffler, Chobot, Commander Nemet, FrankTobia, 
Siddhant, YurikBot, Wavelength, RobotE, RussBot, Michael Slone, Loom91, Grubber, ML, Yahya Abdal-Aziz, Raven4x4x, Moe Epsilon, 
DanBri, Allchopin, Light current, Mceliece, Arthur Rubin, Lyrl, GrinBot-enwiki, Sardanaphalus, Lordspaz, SmackBot, Imz, Henri de 
Solages, Incnis Mrsi, Reedy, InverseHypercube, Cazort, Gilliam, Metacomet, Octahedron80, Spellchecker, Colonies Chris, Jahiegel, Un- 
nikrishnan.am, LouScheffer, Calbaer, EPM, Djcmackay, Michael Ross, Tyrson, Jon Awbrey, Het, Bidabadi-enwiki, Chungc, SashatoBot, 
Nick Green, Harryboyles, Sina2, Lachico, Almkglor, Bushsf, Sir Nicholas de Mimsy-Porpington, FreezBee, Dicklyon, E-Kartolfel, Wiz¬ 
ard 191, Matthew Verey, Isvish, ScottHolden, CapitalR, Gnome (Bot), Tawkerbot2, Marty39, Daggerstab, CRGreathouse, Thermochap, 
Ale jrb, Thomas Keyes, Met mht, Pulkitgrover, Grgarza, Maria Vargas, Roman Cheplyaka, Hpalaiya, Vanished User jdksfajlasd. Near- 
far, Heidijane, Thijs!bot, N5iln, WikilT, James086, PoulyM, Edchi, D.H, Jvstone, HSRT, JAnDbot, BenjaminGittins, RainbowCrane, 
Jthomp4338, Buettcher, MetsBot, David Eppstein, Pax:Vobiscum, MartinBot, Tamer ih~enwiki, Sigmundg, Jargon777, Policron, Useight, 
VolkovBot, Joeoettinger, JohnBlackbume, Jimmaths, Constant314, Starrymessenger, Kjells, Magmi, AllGloryToTheHypnotoad, Bemba, 
Radagast3, Newbyguesses, SieBot, Ivan Stambuk, Robert Loring, Masgatotkaca, Pcontrop, Algorithms, Anchor Link Bot, Melcombe, 
ClueBot, Fleem, Ammarsakaji, Estirabot, 7&6=thirteen, Oldrubbie, Vegetator, Singularity42, Dziewa, Lambtron, Johnuniq, SoxBot III, 
HumphreyW, Vanished user uih38riiw4hjlsd, Mitch Ames, Addbot, Deepmath, Peerc, Eweinber, Sun Ladder, C9900, L.exsteens, Xlasne, 
Egoistorms, Luckas-bot, Quadrescence, Yobot, TaBOT-zerem, Taxisfolder, Carleas, Twohoos, Cassandra Cathcart, Dbln, Materialscientist, 
Informationtheory, Jingluolaodao, Expooz, Raysonik, Xqbot, Isheden, Information tricks, Dani.gomezdp, PHansen, Masrudin, FrescoBot, 
Nageh, Tiramisoo, Sanpitch, Gnomehacker, Pinethicket, Momergil, SkyMachine, SchreyP, Lotje, Miracle Pen, Vanadiumho, Kastchei, 
Djjr, EmausBot, WikitanvirBot, Jmencisom, Bethnim, Quondum, Henriqueroscoe, Terra Novus, ClueBot NG, MelboumeStar, Barrel- 
Proof, TimeOfDei, Frietjes, Thepigdog, Pzrq, MrJosiahT, Lawsonstu, Helpful Pixie Bot, Leopd, BG19bot, Wikil3, Trevayne08, Citation- 
CleanerBot, Brad7777, Schafer510, BattyBot, Bankmichaell, SFK2, Jochen Burghardt, Limit-theorem, Szzoli, 314Username, Roastliras, 
Eigentensor, Comp.arch, SakeUPenn, Logan.dunbar, Prof. Michael Bank, DanBalance, JellydPuppy, KasparBot and Anonymous: 262 

• Computational science Source: https://en.wikipedia.org/wiki/Computational_science?oldid=668832468 Contributors: SebastianHelm, 

Charles Matthews, Fredrik, Mpiper, Ancheta Wis, Behnam, Hugh Mason, Alsocal, Discospinster, Cjb88, Andrewwall, Mykej, Wayne 
Schroeder, Oleg Alexandrov, Ruud Koot, Vegaswikian, Meawoppl, Raikkonen, JJL, SmackBot, Peter Sloot, Chris the speller, Janm67, 
LuchoX, Beetstra, Rhebus, Hu 12, Haus, CmdrObot, Pgr94, Aajaja, Mentifisto, VoABot II, Jwagnerhki, Mrseacow, SharkD, Antony-22, 
JonMcLoone, Arithma, Sargursrihari, Lordvolton, Mcsmom, Douglas256, Kayvan45622, ClueBot, Lgstam, Eurekaman, Tomtzigt, Qwfp, 
Zoidbergmd, T68492, Nicoguaro, Superzoulou, Addbot, Wordsoup, Ekojekoj, Jarble, Yobot, Caracho, AnomieBOT, Materialsci¬ 

entist, P99am, Isheden, CES1596, FrescoBot, Kiefer.Wolfowitz, Gryllida, Willihans, Lotje, Persian knight shiraz, Almarklein, Ouji-fin, 
Beatnik8983, Super48paul, Vilietha, RockMagnetist, Vincehradil, ClueBot NG, Aiwing, DsliceOO, NorthamericalOOO, Nsda, Anbul21, 
EricEnfermero, Ema—or, Compsim, Ice666, Shtamy, Schimmdog, MariSo87, Jaime hoyos b, Yakitawa and Anonymous: 58 

• Exploratory data analysis Source: https://en.wikipedia.org/wiki/Exploratory_data_analysis?oldid=670352181 Contributors: Michael 
Hardy, PuzzletChung, Schutz, Aetheling, Astaines, Cutler, Filemon, Giftlite, Jason Quinn, Khalid hassani, Bender235, El- 
wikipedista-enwiki, Ejrrjs, BlueNovember, Mdd, Oleg Alexandrov, David Haslam, Btyner, Holek, Larman, Rjwilmsi, Chobot, Dppl, 
YurikBot, Wavelength, Morphh, Closedmouth, Zvika, SmackBot, DCDuring, Jtneill, CommodiCast, Nbarth, Gragus, Sergio.ballestrero, 
Abmac, Ft93110, Smallpond, Talgalili, JAnDbot, Magioladitis, Jim Douglas, Johnbibby, Rlsheehan, Luc.girardin, Ignacio Icke, STBotD, 
Botx, Farcaster, Msrasnw, Melcombe, Delphis wk, Linforest, Bea68~enwiki, DragonBot, Ezsdgxrfhcv, Stealth500, BOTarate, JamesX- 
inzhiLi, Qwfp, Tayste, Addbot, Bruce rennes, Kapilghodawat, NjardarBot, Srylesmor, Visnut, Delaszk, Yobot, AlotToLeam, Pablo07, 
Dpoduval, Citation bot, ArthurBot, GrouchoBot, Omnipaedista, Decisal, Thehelpfulbot, Rdledesma, Boxplot, Kiefer.Wolfowitz, Dmitry 
St, DixonDBot, Mean as custard, Valeropedro, Macseven, RockMagnetist, Mathstat, WikiMSL, BG19bot, Therealjoefaith, TwoTwoHello, 
Naabou, Municca, Mkhambaty, Dave Braunschweig, Story645, Fratilias, Alakzi, KasparBot, Olgreenwood and Anonymous: 61 

• Predictive analytics Source: https://en.wikipedia.org/wiki/Predictive_analytics?oldid=669931704 Contributors: The Anome, Edward, 
Michael Hardy, Kku, Karada, Ronz, Avani-enwiki, Giftlite, SWAdair, Thorwald, Bender235, Rubicon, Calair, Rajah, Mdd, Andrew- 
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pmk, Stephen Turner, Dominic, Oleg Alexandrov, OwenX, HughJorgan, RussBot, Leighblackall, Aeusoesl, Gloumouthl, DeadEyeAr- 
row, EAderhold, Zzuuzz, Talyian, Allens, Zvika, Yvwv, SmackBot, CommodiCast, Meld, Gilliam, EncMstr, Eudaemonic3, S2rikanth, 
Onorem, JonHarder, Krexer, Lpgeffen, Doug Bell, Kuru, IronGargoyle, JHunterJ, Ralf Klinkenberg, CmdrObot, Van helsing, Reques¬ 
tion, Pgr94, Myasuda, LeoHeska, Dancter, Talgalili, Scientio, Marek69, JustAGal, Batra, Mr. Darcy, AntiVandalBot, MER-C, Fbooth, 
VoABot II, Baccyak4H, Bellemichelle, Sweet2, Gregheth, Apdevries, Sudheervaishnav, Ekotkie, Dontdoit, Jfroelich, Trusilver, Dvdpwiki, 
Ramkumar.krishnan, Atama, Bonadea, BernardZ, Cyricx, GuyRo, Deanabb, Dherman652, Arpabr, Selain03, Hherbert, Cuttysc, Vikxl, 
Ralftgehrig, Rpm698, MaynardClark, Drakedirect, Chrisguyot, Melcombe, Maralia, Into The Fray, Kai-Hendrik, Ahyeek, Howie Goodell, 
Sterdeus, SpikeToronto, Stephen Milborrow, Jlamro, Isthisthingworking, SchreiberBike, Bateni, Cookiehead, Qwfp, Angoss, Sunsetsky, 
Vianello, MystBot, Vaheterdu, BizAnalyst, Addbot, MrOlhe, Download, Yobot, SOMart, AnomieBOT, IRP, Nosperantos, Citation bot, 
Jtamad, BlaineKohl, Phy31277, CorporateM, GESICC, FrescoBot, Boxplot, I dream of horses, Triplestop, Dmitry St, SpaceFlight89, Jack- 
verr, 2, Peter.borissow, Ethansdad, Cambridgebluel971, Pamparam, Kmettler, Glenn Maddox, Vrenator, Crysb, DARTH SIDIOUS 2, 
Onel5969, WikitanvirBot, SugarfootlOOl, Jssgator, Chire, MainFrame, Synecticsgroup, ClueBot NG, Thirdharringtonskier, Stefanomaz- 
zalai, Ricekido, Widr, Lukel45, Mikeono, Helpful Pixie Bot, WhartonCAI, JonasJSchreiber, Wbml058, BG19bot, Rafaelgmonteiro, 
Lisasolomonsalford, TLAN38, Lynnlangit, Flaticida, MrBilB, MC Wapiti, HHinman, BattyBot, Raspabill, Jeremy Kolb, TwoTwoHello, 
Andrux, Mkhambaty, Cmdima, HeatherMKCampbell, Tommycarney, Tentinator, Sgolestanian, Gbtodd29, Brishtikonna, Thisaccoun- 
tisbs, Mitchki.nj, Jvn mht, JaconaFrere, Pablodim91, Cbuyle, Monkbot, AmandaJohnson2014, JSHorwitz, Thomas Speidel, HappyVDZ, 
Wikiperson99, Justincahoon, Femiolajiga, Stevenfinlay, Rim 1188, Bildn, Nivedital414, Frankecoker, Annaelison, Olosko, Vedanga Ku¬ 
mar, Gary2015, HelpUsStopSpam, Heinrichvk, Rodionos, Olavlaudy and Anonymous: 215 

• Business intelligence Source: https://en.wikipedia.org/wiki/Business_intelligence?oldid=667392522 Contributors: Manning Bartlett, Ant, 
Chuq, Feandrod, Michael Hardy, Norm, Nixdorf, Kku, SebastianHelm, Ellywa, Ronz, Mkoval, Elvis, Mydogategodshat, Jay, Rednblu, 
Pedantl7, Chuckrussell, Traroth, Robbot, ZimZalaBim, Mirv, Aetheling, Fupo, Wile E. Heresiarch, Mattflaschen, Psb777, Ianhowlett, 
Beardo, AlistairMcMillan, Intergalacticz9, Macrakis, Joelm, Khalid hassani, Alem-enwiki, Edcohns, Golbez, Fucky 6.9, Roc, Alexf, 
Beland, Bharatcit, Heirpixel, Karl-Henner, Gscshoyru, DMG413, Kadambarid, Guppyfinsoup, Keystroke, Discospinster, Rhobite, Mart- 
pol, S.K., RJHall, Saturnight, Just zis Guy, you know?, Etz Haim, Tjic, Reinyday, John Vandenberg, Maurreen, MPerel, Nsaa, Mdd, 
Gwalam, Alansohn, Gary, PaulHanson, Arthena, ABCD, Snowolf, Wtmitchell, Evil Monkey, Sciurime, Brookie, Stephen, Zntrip, Dr 
Gangrene, Woohookitty, Mindmatrix, TigerShark, Camw, Arcann, Jetf3000, GregorB, Fiface, Stefanomione, DePiep, Hans Genten, Dou- 
glasGreen~enwiki, Ademkader, Slant, Alberrosidus, AndriuZ, ViriiK, M7bot, Danielsmith, Chrisvonsimson, Bgwhite, Wavelength, Tex- 
asAndroid, StuffOflnterest, RussBot, AVM, Bhny, DanMS, Manop, Grafen, Welsh, Joel7687, Aaron Brenneman, Muu-karhu, Mikeblas, 
Zwobot, Pamela Haas, FangbkOl, Zzuuzz, Chase me ladies, I'm the Cavalry, Arthur Rubin, Nraden, Guillom, Katieh5584, Tom Mor¬ 
ris, Veinor, Drcwright, SmackBot, Schniider-enwiki, McGeddon, MeiStone, Brick Thrower, CommodiCast, Eskimbot, Ohnoitsjamie, 
Folajimi, Jcarroll, Setti, Chris the speller, Bluebot, Stevage, Swells65, Nick Fevine, TheKMan, Xyzzyplugh, Mitrius, Krich, Warren, 
Yasst8, Ohconfucius, Wikiolap, Eliyak, Kuru, Tomhubbard, Dreamrequest, ElixirTechnology, Beetstra, Ashil04, Frederikton, Blork-mtl, 
Farrymcp, Waggers, MTSbot-enwiki, Peyre, Aspandyar, Apolitano, AdjustablePliers, OnBeyondZebrax, Fancet75, IvanFanin, Azl568, 
Nhgaudreau, Codeculturist, HMishkoff, SkyWalker, Racecarradar, CmdrObot, ShelfSkewed, Nmourfield, Cryptblade, Dancter, XOlani, 
Roberta F., FrancoGG, Thijs!bot, Wemight, Qwyrxian, Czenek~enwiki, PerfectStorm, CharlesHoffman, Batra, QuiteUnusual, Prolog, 
Charlesmnicholls, Kbeneby, Ffstevens, JAnDbot, Samholm, MER-C, Rongou, YK Times, Entgroupzd, Technologyvoices, Supercactus, 
Magioladitis, VoABot II, Rajashekar iitm, Vanished user tyl2kl89jql0, Nposs, Ionium, Peters72, WFU, Halfgoku, Cquels, Iamthenewno2, 
R'n'B, Gary a mason, Trusilver, Svetovid, DanDoughty, Siryendor, Extransit, A40220, Wxhatl, Sinotara, Srknet, Edit06, Mark Bosley, 
Naniwako, F'Aquatique, Islamomt, Wendecover, Priyank bolia, WinterSpw, Phani96, Seankenalty, Jeff G., Philip Trueman, TXiKiBoT, 
Blackstarl38, Perohanych, Technopat, Rich Janis, Fredsmith2, Mcclarke, Bansipatel, Andy Dingley, JukoFF, Wikidan829, Ceranthor, 
Quantpole, Ermite-enwiki, Hazel77, Dwandelt, Moonriddengirl, SEOtools, Julianclark, Android Mouse Bot, Ireas, ObserverToSee, Corp 
Vision, Janner210, Aadeel, Bcarrdba, Ncw0617, Melcombe, Denisarona, Jvlock527, Ukpremier, Martarius, ClueBot, WriterFistener, 
Natasha81, John ellenberger. Supertouch, Ryan Rasmussen, Chrisawiki, Niceguyedc, Rickybarron, FeoFrank, Srkview, Aexis, Seth- 
Grimes, Kit Berg, Jpnofseattle, Mymallandnews, Tompana82, Sierramadrid, DumZiBoT, Man koznar-enwiki, Jmkim dot com, Ejossel, 
Writerguy71, Addbot, Butterwell, Mehtasanjay, Wsvlqc, Ronhjones, BlackLips, MrOllie, Glane23, Fauxstar, Fightbot, 111, Pravi- 
surabhi, Fuckas-bot, Yobot, Fraggle81, Travis.a.buckingham, Evansl982, Becky Sayles, CoolpriyankalO, IW.HG, Sualfradique, Intelli¬ 
gent knowl edgeyoudrive, AnomieBOT, Rubinbot, Jsmithll08, Piano non troppo, Materialscientist, Citation bot, Stationcall, Quebec99, 
Jehan21, BlaineKohl, Momotoshi, Wperdue, JVRudnick, Prazan, Euthenicsit, Wilcoxaj, RibotBOT, Urchandru, Mathonius, Fovede- 
mon84, Shadowjams, Opagecrtr, Forceblue, Force88, UfHTff, Mark Renier, Wiki episteme, Glaugh, D'ohBot, Greenboite, Pacific202, 
Pinethicket, Rayrubenstein, Qqppqqpp, Triplestop, Jim380, Serols, Dnedzel, Jandalhandler, Steelsabre, Ordnascrazy, Hyphen DJW, ITPer- 
forms, Ethansdad, Genuineditference, Wondigoma, Iaantheron, Ansumang, Crysb, Dr.apostrophe, Goyalaishwarya, Sulhan, Vasant Dhar, 
Nawod, RjwilmsiBot, Bonanjef, Ananthnats, Rolhns83, ITtalker, Helwr, Fogical Cowboy, Timtempleton, JEF888, Dewritech, Kellylautt, 
K6ka, AsceticRose, Jesaisca, Dnazip, Fae, Jahub, TheWakeUpFactory, RuislickO, Alpha Quadrant (alt), Eken7, Makecat, Erianna, Tjtyrrell, 
Openstrings, F Kensington, Yorkshiresoul, Bevelson, Alexandra31, Beroneous, Hanantaiber, Outbackprincess, ClueBot NG, CaveJohnson, 
AMJBIUser, This lousy T-shirt, Jaej, Qarakesek, Happyinmaine, Robiminer, Mathew 105601, Widr, Pmresource, Helpful Pixie Bot, Tim- 
Mulherin, Bpml23, Kaimul7, BG19bot, Vaulttech, Mr.Gaebrial, Joshua.pierce84, Xjengvyen, Jwcga, Chafe66, Foripiquet, Einsteinlebt, 
Reverend T. R. Malthus, Meclee, Sutanupaul, Y.Kondrykava, Khazar2, Dhavalp453, Jkofron4, Ivytrejo, Cwobeel, Meg2765, Mogism, 
Bpmbooks, XXN, Riyadmks, OnTheNet21, Michael.h.zimmerman, Ergoodell, Zkhall, Mangotron, HowardDresner, Mikevandeneijnden, 
Yanis ahmed, Dkrapohl, Ginsuloft, ReclaGroup, BIcurious3334, DauphineBI, Fakun.patra, Compprof9000, Mgt88drcr, Julep.hawthorne, 
Tastiera, Marc Schpnwandt, TechnoTalk, Wiki-jonne, BrettofMoore, Vanished user 9j34rnfjemnrjnasj4, Generalcontributor, Clumsied, 
Mihaescu Constantin, Deever21, Xpansa, Galaktikasoft, Frankecoker, Gary2015, BrandonMcBride, Brendonritz, ThatKongregateGuy, 
Soheilmamani and Anonymous: 692 

• Analytics Source: https://en.wikipedia.org/wiki/Analytics?oldid=670788887 Contributors: SimonP, Michael Hardy, Kku, Ronz, Julesd, 
Charles Matthews, Dysprosia, Kadambarid, Stephenpace, Visviva, Hanswaarle, Jeff3000, GrundyCamellia, Rjwilmsi, Intgr, Srleffler, 
Rick hghtbum, DeadEyeArrow, MagneticFlux, SmackBot, C.Fred, CommodiCast, Ohnoitsjamie, BenAveling, PitOfBabel, Deli nk, 
Sergio.ballestrero, Wikiolap, Kuru, Ocatecir, 16@r, Beetstra, RichardF, IvanFanin, Hobophobe, Famiot, Zgemignani, NishithSingh, 
Gogo Dodo, Barticus88, Brandoneus, QuiteUnusual, TFinn734, Magioladitis, Prabhul37, Elringo, Kimleonard, KylieTastic, Jevansen, 
VolkovBot, Trevorallred, Jimmaths, Tavix, BarryFist, Bansipatel, Rpanigassi, Kerenb, Falcon8765, FittleBenW, Planbhups, Sanya r, Mel¬ 
combe, Maralia, Aharol, Apptrain, Ottawahitech, Cyberjacob, GDibyendu, Deineka, Addbot, Mortense, Freakmighty, MrOllie, Glory- 
daze716, Fuckas-bot, Yobot, Ptbotgourou, Freikorp, AnomieBOT, HikeBandit, Spugsley, BlaineKohl, Kerberusl3, Omnipaedista, Em- 
cien, FrescoBot, James Doehring, Ethansdad, Jonkerz, RjwilmsiBot, TjBot, DASHBot, Timtempleton, Kellylautt, Tmguru, Simplyuttam, 
Idea Farm, KyleAraujo, Gregory787, Paolo787, ClueBot NG, Networld 1965, Helpful Pixie Bot, WhartonCAI, Wbm 1058, Mkennedy 1981, 
BG19bot, Pine, Jobin RV, Foripiquet, Analytically, MikeFampaBI, Clkim, Melenc, Cryptodd, TheAdamEvans, Vishal.dani, OnTheNet21, 
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Municca, Dougs Campbell, Gmid associates, Makvar, RupertLiptonl986, I am One of Many, Edwinboothnyc, JuanCarlosBrandt, Luke- 
bradford, Jenny Rankin, Prussonyc, Rzicari, Ramg iitk, Daph8, Filedelinkerbot, Raj Aradhyula, Aymanmogh, Bildn, Anecdotic, Vidyasnap, 
Abcdudtc and Anonymous: 110 

• Data mining Source: https://en.wikipedia.org/wiki/Data_mining?oldid=670989267 Contributors: Dreamyshade, WojPob, Bryan Derk- 
sen. The Anome, Ap, Verloren, Andre Engels, Fcueto, Matusz, Deb, Boleslav Bobcik, Hefaistos, Mswake, N8chz, Michael Hardy, Con- 
fusss, Fred Bauder, Isomorphic, Nixdorf, Dhart, Ixfd64, Lament, Alfio, CesarB, Ahoerstemeier, Haakon, Ronz, Angela, Den fjattrade 
ankan-enwiki, Netsnipe, Jfitzg, Tristanb, Hike395, Mydogategodshat, Dcoetzee, Andrevan, Jay, Fuzheado, WhisperToMe, Epic~enwiki, 
Tpbradbury, Furrykef, Traroth, Nickshanks, Joy, Shantavira, Pakcw, Robbot, ZimZalaBim, Altenmann, Henrygb, Ojigiri~enwiki, Sun- 
ray, Aetheling, Apogr-enwiki, Wile E. Heresiarch, Tobias Bergemann, Filemon, Adam78, Alan Liefting, Giftlite, ShaunMacPherson, 
Sepreece, Philwelch, Tom harrison, Jkseppan, Simon Lacoste-Julien, Ianhowlett, Varlaam, LarryGilbert, Kainaw, Siroxo, Adam McMas- 
ter. Just Another Dan, Neilc, Comatose51, Chowbok, Gadfium, Pgan002, Bolol729, SarekOfVulcan, Raand, Antandrus, Onco p53, Over- 
lordQ, Gscshoyru, Urhixidur, Kadambarid, Mike Rosoft, Monkeyman, Keystroke, Rich Farmbrough, Nowozin, Stephenpace, Vitamin b, 
Bender235, Flyskippyl, Mamer, Aaronbrick, Etz Haim, Janna Isabot, Mike Schwartz, John Vandenberg, Maurreen, Ejrrjs, Nsaa, Mdd, 
Alansohn, Gary, Walter Gorlitz, Denoir, Rd232, Jeltz, Jet57, Jamiemac, Malo, Compo, Caesura, Axeman89, Vonaurum, Oleg Alexan¬ 
drov, Jefgodesky, Nuno Tavares, OwenX, Woohookitty, Mindmatrix, Katyare, TigerShark, LOL, David Haslam, Ralf Mikut, GregorB, 
Hynespm, Essjay, MarcoTolo, Joerg Kurt Wegner, Simsong, Lovro, Tslocum, Graham87, Deltabeignet, BD2412, Kbdank71, DePiep, 
CoderGnome, Chenxlee, Sjakkalle, Rjwilmsi, Gmelli, Lavishluau, Michal.burda, Bubba73, Bensin, GeorgeBills, GregAsche, HughJor- 
gan, Twerbrou, FlaBot, Emarsee, AlexAnglin, Ground Zero, Mathbot, Jrtayloriv, Predictor, Bmicomp, Compuneo, Vonkje, Gurubrahma, 
BMF81, Chobot, DVdm, Bgwhite, The Rambling Man, YurikBot, Wavelength, NTBot-enwiki, H005, Phantomsteve, AVM, Hede2000, 
Splash, SpuriousQ, Ansell, RadioFan, Hydrargyrum, Gaius Cornelius, Philopedia, Bovineone, Zeno of Elea, EngineerScotty, Nawlin- 
Wiki, Grafen, ONEder Boy, Mshecket, Aaron Brenneman, Jpbowen, Tonyl, Dlyons493, DryaUnda, Bota47, Tlevine, Ripper234, Gra- 
ciella, Deville, Zzuuzz, Lt-wiki-bot, Fang Aili, Pb30, Modify, GraemeL, Wikiant, JoanneB, LeonardoRobOt, ArielGold, Katieh5584, 
John Broughton, SkerHawx, Capitalist, Palapa, SmackBot, Looper5920, ThreeDee912, TestPilot, Unyoyega, Cutter, KocjoBot-enwiki, 
Bhikubhadwa, Thunderboltz, CommodiCast, Comp8956, Delldot, Eskimbot, Slhumph, Onebravemonkey, Ohnoitsjamie, Skizzik, Some- 
wherepurple, Leo505, MK8, Thumperward, DHN-bot~enwiki, Tdelamater, Antonrojo, Dilferentview, Janvo, Can't sleep, clown will eat 
me, Sergio.ballestrero, Frap, Nixeagle, Serenity-Fr, Thefriedone, JonHarder, Propheci, Joinamold, Bennose, Mackseem-enwiki, Rada- 
gast83, Nibuod, Daqu, DueSouth, Blake-, Krexer, Weregerbil, Vina-iwbot~enwiki, Andrei Stroe, Deepred6502, Spiritia, Lambiam, Wiki- 
olap, Kuru, Bmhkim, Vgy7ujm, Calum MacUisdean, Athemar, Burakordu, Feraudyh, 16@r, Beetstra, Mr Stephen, Jimmy Pitt, Julthep, 
Dicklyon, Waggers, Ctacmo, RichardF, Nabeth, Beefyt, Hu 12, Enggakshat, Vijay.babu.k, Ft93110, Dagoldman, Veyklevar, Ralf Klinken- 
berg, JHP, IvanLanin, Paul Foxworthy, Adrian.walker, Linkspamremover, CRGreathouse, CmdrObot, Filip*, Van helsing, Shorespirit, 
Mattl299, Kushal one, CWY2190, Ipeirotis, Nilfanion, Cydebot, Valodzka, Gogo Dodo, Ar5144-06, Akhil joey, Martin Jensen, Pingku, 
Oli2140, Mikeputnam, Talgalili, Malleus Fatuorum, Thijs!bot, Barticus88, Nirvanalulu, Drowne, Scientio, Kxlai, Headbomb, Ubuntu2, 
AntiVandalBot, Seaphoto, Ajaysathe, Gwyatt-agastle, Onasraou, Spencer, Alphachimpbot, JAnDbot, Wiki0709, Barek, Samholm, MER- 
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Gombang, Chiswick Chap, Goingstuckey, Policron, Juliancolton, Homo logos, JohnBlackburne, Philip Trueman, TXiKiBoT, Anony¬ 
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SmackBot, NickyMcLean, Quazar777, Prodego, InverseHypercube, Jtneill, DanielPenfield, Evanreyes, Commander Keane bot, Ohnoits- 
jamie, Hraefen, Afa86, Markush, Amatulic, Feinstein, Oli Filth, John Reaves, Berland, Wolf87, Cybercobra, Semanticprecision, G716, 
Unco, Theblackgecko, Lambiam, Jonas August, Vjeet a, Nijdam, Beetstra, Emurph, Hu 12, Pjrm, LAlawMedMBA, JoeBot, Chris53516, 
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Leapsword, BattyBot, Attleboro, Viraltux, Tibbyshep, TheJJJunk, Illia Connell, Knappsych, Citruscoconut, GargantuanDan, TejDham, 
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Sweep, L353al, FelineAvenger, APH, Sam Hocevar, Perey, Discospinster, Rich Farmbrough, Bender235, ZeroOne, Donsimon-enwiki, 
MisterSheik, El C, Edward Z. Yang, DimaDorfman, Cje-enwiki, John Vandenberg, LeonardoGregianin, Jung dalglish, Hooperbloob, 
Landroni, Arcenciel, Nurban, Avenue, Cburnett, Jheald, Facopad, Sjara, Oleg Alexandrov, Roylee, Joriki, Mindmatrix, BlaiseFEgan, 
Btyner, Magister Mathematicae, Tlroche, Rjwilmsi, Ravik, Jeffmcneill, Billjefferys, FlaBot, Brendan642, Kri, Chobot, Reetep, Gdrbot, 
Adoniscik, Wavelength, Pacaro, Gaius Cornelius, ENeville, Dysmorodrepanis-enwiki, SnekOl, BenBildstein, Modify, Mastercampbell, 
Nothht, NielsenGW, Mebden, Bo Jacoby, Cmglee, Boggie-enwiki, Harthacnut, SmackBot, Mmernex, Rtc, Meld, Cunya, Gilliam, Doc- 
torW, Nbarth, G716, Jbergquist, Turms, Bejnar, Gh02t, Wyxel, Josephsieh, JeonghunNoh, Thermochap, BoH, Basar, TheRegicider, 
Farzaneh, Lindsay658, Tdunning, Helgus, EdJohnston, Jvstone, Mack2, Lfstevens, Makohn, Stephanhartmannde, Comrade jo, Ph.eyes, 
Cotfee2 theorems, Ling.Nut, Charlesbaldo, DAGwyn, User Al, Tercer, STBot, Tobyr2, LittleHow, Policron, Jeff badge, Bhepbum, Rob- 
calver, James Kidd, VolkovBot, Thedjatclubrock, Maghnus, TXiKiBoT, Andrewaskew, GirasoleDE, SieBot, Doctorfree, Natta.d, Anchor 
Link Bot, Melcombe, Kvihill, Rfinchdavis, Smithpith, GeneCallahan, Krogstadt, Reovalis, Hussainshafqat, Charledl, ERosa, Qwfp, Td- 
slk, XLinkBot, Erreip, Addbot, K-MUS, Metagraph, LaaknorBot, Ozob, Legobot, Yobot, Gongshow, AnomieBOT, Citation bot, Shadak, 
Danielshin, VladimirReshetnikov, KingScot, JonDePlume, Thehelpfulbot, FrescoBot, Olexa Riznyk, WhatWasDone, Haeinous, JFK0502, 
Kiefer.Wolfowitz, 124Nick, Night Jaguar, Scientist2, Trappist the monk, Gnathan87, Philocentric, Jonkerz, Jowa fan, EmausBot, Blume- 
hua, Montgolfiere, Moswento, McPastry, Bagrowjp, SporkBot, Willy.preghasco, Floombottle, Epdeloso, ClueBot NG, Mathstat, Bayes 
Puppy, Jj 1236, Albertttt, Thepigdog, Helpful Pixie Bot, Michael.d.larkin, Jeraphine Gryphon, Whyking the, Intervallic, CitationCleaner¬ 
Bot, DaleSpam, Kaseton, Simonsm21, Danielribeirosilva, ChrisGualtieri, Alialamifard, Yongli Han, 90b56587, MittensR, Mark viking, 
Boomx09, Waynechew87, Hamoudafg, Promise her a definition, Abacenis, Engheta, Avehtari, SolidPhase, LadyLeodia, KasparBot and 
Anonymous: 228 

• Chi-squared distribution Source: https://en.wikipedia.org/wiki/Chi-squared_distribution?oldid=666620239 Contributors: AxelBoldt, 
Bryan Derksen, The Anome, Ap, Michael Hardy, Stephen C. Carlson, Tomi, Mdebets, Ronz, Den fjattrade ankan-enwiki, Willem, Jitse 
Niesen, Hgamboa, Fibonacci, ZeroOOOO, AaronSw, Robbot, Sander 123, Seglea, Henrygb, Robinh, Isopropyl, Weialawaga-enwiki, Giftlite, 
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Dbenbenn, BenFrantzDale, Herbee, Sietse, MarkSweep, Gauss, Zfr, Fintor, Rich Farmbrough, Dbachmann, Paul August, Bender235, 
MisterSheik, 018, TheProject, NickSchweitzer, lav, Jumbuck, B k, Kotasik, Sligocki, PAR, Cbumett, Shoefly, Oleg Alexandrov, Mindma- 
trix, Btyner, Rjwilmsi, Pahan-enwiki, Salix alba, FlaBot, Alvin-cs, Pstevens, Philten, Roboto de Ajvol, YurikBot, Wavelength, Schmock, 
Tonyl, Zwobot, JspacemenOl-wiki, Reyk, Zvika, KnightRider-enwiki, SmackBot, Eskimbot, BiT, Afa86, Bluebot, TimBentley, Master 
of Puppets, Silly rabbit, Nbarth, AdamSmithee, Iwaterpolo, Eliezg, Robma, A.R., G716, Saippuakauppias, Rigadoun, Loodog, Mgigan- 
teusl, Qiuxing, Funnybunny, Chris53516, Tawkerbot2, Jackzhp, CBM, Rflrob, Dgw, FilipeS, Blaisorblade, Talgalili, Thijs!bot, DanSoper, 
Lovibond, Pabristow, MER-C, Plantsurfer, Mcorazao, J-stan, Leotolstoy, Wasell, VoABot II, Jaekrystyn, User Al, TheRanger, MartinBot, 
STBot, Steve8675309, Neon white, Icseaturtles, It Is Me Here, TomyDuby, Mikael Haggstrom, Quantling, Policron, Nm420, HyDeckar, 
Sam Blacketer, DrMicro, LeilaniLad, Gaaral44, AstroWiki, Notatoad, Johnlvl2, Wesamuels, Tarkashastri, Quietbritishjim, Rlendog, 
Sheppa28, Phe-bot, Jason Goldstick, Tombomp, OKBot, Melcombe, Digisus, Volkan.cevher, Loren.wilton, Animeronin, ClueBot, Jdg- 
ilbey, MATThematical, UKoch, SamuelTheGhost, EtudiantEco, Bluemaster, Qwfp, XLinkBot, Knetlalala, MystBot, Paulginz, Fergikush, 
Tayste, Addbot, Fgnievinski, Fieldday-sunday, MrOllie, Download, LaaknorBot, Renatokeshet, Lightbot, Ettrig, Chaldor, Luckas-bot, 
Yobot, Wjastle, Johnlemartirao, AnomieBOT, Microball, MtBell, Materialscientist, Geekl337~enwiki, EOBarnett, DirlBot, LilHelpa, 
Lixiaoxu, Xqbot, Eliel Jimenez, Etoombs, Control.valve, NocturneNoir, GrouchoBot, RibotBOT, Entropeter, Shadowjams, Griffinofwales, 
Constructive editor, FrescoBot, Tom.Reding, Stpasha, MastiBot, Gperjim, Fergusq, Xnn, RjwilmsiBot, Kastchei, Alph Bot, Wassermann7, 
Markg0803, EmausBot, Yuzisee, Dai bach, Pet3ris, U+003F, Zephyrus Tavvier, Levdtrotsky, ChuispastonBot, Emilpohl, Brycehughes, 
ClueBot NG, BG19bot, Analytics447, Snouffy, Drhowey, Dlituiev, Minsbot, HelicopterLlama, Limit-theorem, Ameer diaa, Idoz he, 
Zjbranson, DonaghHorgan, Catalin.ghervase, BeyondNormality, Monkbot, Alakzi, Bderrett, Uceeylu and Anonymous: 238 

• Chi-squared test Source: https://en.wikipedia.org/wiki/Chi-squared_test?oldid=667776610 Contributors: The Anome, Matusz, Michael 
Hardy, Tomi, Karada, Ronz, Ciphergoth, Jfitzg, Mxn, Silverfish, Crissov, Robbot, Giftlite, Andris, Matt Crypto, MarkSweep, Piotrus, 
Elektron, Rich Farmbrough, Cap'n Refsmmat, Kwamikagami, Smalljim, PAR, Stefan.karpinski, Spangineer, Wtmitchell, Falcorian, Blue- 
moose, Aatombomb, Strait, MZMcBride, Yar Kramer, JoseMires-enwiki, Intgr, Pstevens, YurikBot, Wavelength, Darker Dreams, DY¬ 
LAN LENNON-enwiki, Avraham, Arthur Rubin, Reyk, SmackBot, Turadg, Nbarth, Scwlong, Whpq, Bowlhover, G716, Lambiam, Cron- 
holml44, Loodog, Tim bates, Smith609, Beetstra, Chris53516, Usgnus, Dgw, Requestion, WeggeBot, Steel, Kamna8, Talgalili, Thijslbot, 
Adjespers, Itsmejudith, AntiVandalBot, ReviewDude, Seaphoto, Johannes Simon, Ranger2006, Baccyak4H, KenyaSong, Serviscope Mi¬ 
nor, MartinBot, Poeloq, Lbeaumont, Khatterj, JoshuaEyer, STBotD, VolkovBot, Pleasantville, Grotendeels Onschadelijk, Synthebot, Igno- 
scient, SieBot, Matthew Yeager, Quest for Truth, Svick, Melcombe, Digisus, Tuxa, Animeronin, ClueBot, Muhandes, Qwfp, Tdslk, Wik- 
Head, SilvonenBot, Prax54, Sindbad72, Tayste, Addbot, Luzingit, Doronp, MrOllie, Bhdavisl978, Legobot, Luckas-bot, Yobot, Amirobot, 
AnomieBOT, Walter Grassroot, Unara, Jtamad, KuRiZu, GrouchoBot, Joxemai, Thehelpfulbot, Pinethicket, I dream of horses, Madonius, 
Kastchei, EmausBot, Kgwet, Lolcatsdeamonl3, Orange Suede Sofa, Levdtrotsky, ClueBot NG, Hyiltiz, Ion vasilief, Epfuerst, Helpful 
Pixie Bot, Evcifreo, Chafe66, Jf.alcover, Aymankamelwiki, Brirush, RichardMarioFratini, MNikulin, EJM86, BethNaught, Hannasnow, 
Iwilsonp, Pentaquark and Anonymous: 140 

• Goodness of fit Source: https ://en.wikipedia.org/wiki/Goodness_of_fit?oldid=664740099 Contributors: Khendon, Michael Hardy, Ronz, 
Den fjattrade ankan-enwiki, Benwing, David Edgar, Giftlite, ReallyNiceGuy, Army 1987, NickSchweitzer, Keflavich, Alkarex, Btyner, 
Demian 12358, YurikBot, Wavelength, Amakuha, Jon Olav Vik, Carlosguitar, Slashme, Kslays, Chris the speller, BostonMA, Mwtoews, 
Nutcracker, Dicklyon, Belizefan, Jayen466, Mr Gronk, Talgalili, Thijs!bot, Danger, Ph.eyes, Fjalokin, Glrx, Bonadea, SueHay, Llam- 
abr, Tomaxer, Melcombe, ClueBot, Tomas e, Hoskee, Qwfp, DumZiBoT, Addbot, Fgnievinski, Renatokeshet, AnomieBOT, Gumlicks, 
Joxemai, Fortdj33, Kastchei, John of Reading, Dai bach, JordiGH, Mathstat, MerllwBot, Bhaveshpatil04 and Anonymous: 51 

• Likelihood-ratio test Source: https://en.wikipedia.org/wiki/Likelihood-ratio_test?oldid=668781783 Contributors: The Anome, Fnielsen, 
Torfason, Michael Hardy, Kku, Notheruser, Den fjattrade ankan-enwiki, Jfitzg, Cherkash, Unknown, Seglea, Meduz, Babbage, Henrygb, 
Elysdir, Robinh, Giftlite, Pgan002, MarkSweep, Corti, Bender235, El C, Arcadian, Seans Potato Business, Cbumett, Jheald, Oleg Alexan¬ 
drov, Btyner, Graham87, NeoUrfahraner, Pete.Hurd, Thecurran, Adoniscik, YurikBot, Cancan 101, Draeco, Robertvanl, RL0919, Nescio, 
Badgettrg, SmackBot, Tom Lougheed, Rajah9, Nbarth, Yimmieg, Moverly, Tim bates, Dchudz, Smith609, AnRtist, Jackzhp, RobDe68, 
AgentPeppermint, Guy Macon, Mack2, Kniwor, JamesBWatson, TomyDuby, Quantling, Cmcnicoll, AlleborgoBot, Arknascar44, Adis- 
malscientist, Jeremiahrounds, Melcombe, Mild Bill Hiccup, Wildland, lForTheMoney, Qwfp, Jmac2222, Prax54, Jht4060, Tayste, Addbot, 
DOI bot, Nilayvaish, Legobot, Zaqrfv, AnomieBOT, Citation bot, Twri, LilHelpa, ArcadianOnUnsecuredLoc, Kristjan Jonasson, Fortdj33, 
Vthesniper, Octonion, Aryanl989, HRoestBot, Ridgeback22, Madbix, Kastchei, Salvio giuliano, EmausBot, Wiki091005!!, Fanyavizuri, 
Frietjes, Masssly, Sboludo, Helpful Pixie Bot, Fayuel015, NaftaliHarris, BG19bot, Chafe66, Limesave, Chuk.plante, Dexbot, NmlbOll 1, 
Penitence, FrB.TG, TedPSS and Anonymous: 88 

• Statistical classification Source: https://en.wikipedia.org/wiki/Statistical_classification?oldid=666781901 Contributors: The Anome, 
Michael Hardy, GTBacchus, Hike395, Robbot, Benwing, Giftlite, Beland, Violetriga, Kierano, Jerome, Anthony Appleyard, Denoir, Oleg 
Alexandrov, Bkkbrad, Qwertyus, Bgwhite, Roboto de Ajvol, YurikBot, Jrbouldin, Dtrebbien, Tiffanicita, Tobi Kellner, SmackBot, Ob- 
jectOl, Meld, Chris the speller, Nervexmachina, Can't sleep, clown will eat me, Memming, Cybercobra, RichardOOl, Bohunk, Beetstra, 
Hu 12, Billgaitas@hotmail.com, Trauber, Juansempere, Thijs!bot, Prolog, Mack2, Peteymills, VoABot II, Robotmanl974, Quocminh9, 
RJASE1, Jamelan, ThomHImself, Gdupont, Junling, Melcombe, WikiBotas, Agorl53, Addbot, Giggly37, Fgnievinski, SpBot, Movado73, 
Yobot, Oleginger, AnomieBOT, Ashershowl, Verbum Veritas, FrescoBot, Gire 3pich2005, DrilBot, Classifier1234, Jonkerz, Fly by Night, 
Microfries, Chire, SigmaO 1, Rmashhadi, ClueBot NG, Girish280, MerllwBot, Helpful Pixie Bot, Chywe, Swsboarder366, Klilidiplomus, 
Ferrarisailor, Mark viking, Francisbach, Imphil, I Less than3 Maths, LdyBruin and Anonymous: 65 

• Binary classification Source: https://en.wikipedia.org/wiki/Binary_classification?oldid=668840507 Contributors: The Anome, Michael 
Hardy, Kku, Nichtich-enwiki, Janka-enwiki, Henrygb, Sepreece, Wmahan, Dfrankow, Nonpareility, 3mta3, Oleg Alexandrov, Linas, 
Btyner, Qwertyus, Salix alba, FlaBot, Jaraalbe, DRosenbach, RG2, SmackBot, Chris the speller, Nbarth, Mauro Bieg, Ebraminio, 
Amit Moscovich, Coolhandscot, Mstillman, STBot, Rlsheehan, Salih, Mikael Haggstrom, HELvet, Dr.007, Synthesis88, Jamelan, The- 
fellswooper, Pkgx, Melcombe, Denisarona, Mild Bill Hiccup, Qwfp, AndrewHZ, Fgnievinski, Yobot, AnomieBOT, Twri, Saeidpourbabak, 
FrescoBot, Duoduoduo, MartinThoma, Pablo Picossa, Alishahss75ali, Sds57, SoledadKabocha, Alialamifard, Richard Kohar, DoctorTer- 
rella, Loraof, Shirleyyoung0812 and Anonymous: 32 

• Maximum likelihood Source: https://en.wikipedia.org/wiki/Maximum_likelihood?oldid=670023975 Contributors: The Anome, 
ChangChienFu, Patrick, Michael Hardy, Lexor, Dcljr, Karada, Ellywa, Den fjattrade ankan-enwiki, Cherkash, Hike395, Samsara, Phil 
Boswell, R3mOt, Guan, Henrygb, Robinh, Giftlite, DavidCary, BenFrantzDale, Chinasaur, Jason Quinn, Urhixidur, Rich Farmbrough, 
Rama, Chowells, Bender235, Maye, Violetriga, 3mta3, Arthena, Inky, PAR, Cburnett, Algocu, Ultramarine, Oleg Alexandrov, James 
I Hall, Rschulz, Btyner, Marudubshinki, Graham87, BD2412, Rjwilmsi, Koavf, Cjpuffin, Mathbot, Nivix, Jrtayloriv, Chobot, Reetep, 
YurikBot, Wavelength, Cancan 101, Dysmorodrepanis-enwiki, Avraham, Saric, Bo Jacoby, XpXiXpY, Zvika, SolarMcPanel, SmackBot, 
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Royalguardl 1, Warren.cheung, Chris the speller, Nbarth, Hongooi, Juffi, Earlh, TedE, Dreadstar, G716, Qwerpoiu, Loodog, Lim Wei 
Quan, Rogerbrent, Hul2, Freeside3, Cbrownl023, Lavaka, Simo Kaupinmaki, 137 0, Travelbird, Headbomb, John254, ZlOx, Nick Num¬ 
ber, Binarybits, Alfalfahotshots, Magioladitis, Albmont, Livingthingdan, Baccyak4H, Julian Brown, A3nm, Cehc84, Algebraic, R'n'B, 
Lilac Soul, Samikrc, Rlsheehan, Gilll 10951, AntiSpamBot, Policron, MJamesCA, Mathuranathan, JeffreyRMiles, Slaunger, TXiKiBoT, 
Ataboy, JimJJewett, Jsdll5, Ramiromagalhaes, Henrikholm, AlleborgoBot, Matt Gleeson, Logan, Zonuleofzinn, Quietbritishjim, Bot- 
Multichill, CurranH, Davidmosen, Svick, Melcombe, Classicalecon, Davyzhu, The Thing That Should Not Be, Drazick, Alexbot, Nu- 
clearWarfare, Qwfp, Agravier, Vitanyi, Ninja247, Addbot, DOI bot, Lucifer87, MrOllie, SpBot, Hawk8103, Westcoastrl3, Luckas-bot, 
Yobot, Amirobot, Mathdrum, Zbodnar, AnomieBOT, DemocraticLuntz, RVS, Citation bot, ArthurBot, Xqbot, Flavio Guitian, Xappppp, 
Shadowjams, Afl523, FrescoBot, LucienBOT, Citation bot 1, EduardoValle, Kiefer.Wolfowitz, Jmc200, Stpasha, BPets, Caspll, Dimtsit, 
Jingyu.cui, RjwilmsiBot, Brandynwhite, Set theorist, Cal-linux, K6ka, Chadhoward, DavidMCEddy, Alexey.kudinkin, JA(000)Davidson, 
SporkBot, Zfeinst, Zueignung, Gjshisha, Mikhail Ryazanov, ClueBot NG, Gareth Griffith-Jones, MelbourneStar, Arthurcburigo, Frietjes, 
Delusion23, Nak9x, Helpful Pixie Bot, Tony Tan, TheMathAddict, Dlituiev, Khazar2, Illia Connell, JYBot, Oneway3124, JamesmcmahonO, 
Deschwartz, Hamoudafg, Crispulop, LokeshRavindranathan, Monkbot, Engheta, Isambard Kingdom and Anonymous: 200 

• Linear classifier Source: https://en.wikipedia.org/wiki/Linear_classifier?oldid=669802403 Contributors: The Anome, Hike395, Whis- 
perToMe, Rls, Benwing, Bernhard Bauer, Wile E. Heresiarch, BenFrantzDale, Neilc, MarkSweep, Bobo 192, Jung dalglish, Arcenciel, 
Oleg Alexandrov, Linas, Bluemoose, Qwertyus, Mathbot, Sderose, Daniel Mietchen, SmackBot, Hongooi, Mcswell, Marcuscalabresus, 
Thijs!bot, Camphor, AnAj, Dougher, BrotherE, TXiKiBoT, Qxz, Phe-bot, Melcombe, SPiNoZA, Jakarr, Addbot, Yobot, AnomieBOT, 
Sgoder and Anonymous: 21 

• Logistic regression Source: https://en.wikipedia.org/wiki/Logistic_regression?oldid=671175842 Contributors: Twanvl, Michael Hardy, 
Tomi, Den fjattrade ankan-enwiki, Benwing, Gak, Giftlite, BrendanH, YapaTi-enwiki, Dfrankow, Pgan002, Bolol729, Qef, Rich Farm- 
brough, Manil, Bender235, Mdf, 018, NickSchweitzer, CarrKnight, Kierano, Arthena, Velella, Oleg Alexandrov, LOL, BlaiseFEgan, 
Qwertyus, Rjwilmsi, Ground Zero, Sderose, Shaggyjacobs, YurikBot, Wavelength, CancanlOl, Johndburger, Rodolfo Hermans, Jtneill, 
D nathl, Lassefolkersen, Aldaron, G716, Esrever, RomanSpa, Nutcracker, Cbuckley, Ionocube, Kenkleinman, Markjosephl25, David s 
graff, Jjoseph, Olberd, Requestion, Future Perfect at Sunrise, Kallerdis, Neoforma, LachlanA, Tomixdf, Mack2, JAnDbot, Every Creek 
Counts, Sanchom, Owenozier, Magioladitis, Baccyak4H, Nszilard, Mbhiii, Lilac Soul, Kpmiyapuram, Djjrjr, Ronny Gunnarsson, Ype- 
trachenko, Dvdpwiki, Squids and Chips, Mobeets, Ktalon, TXiKiBoT, Harrelfe, Antaltamas, Aaport, Jamelan, Synthebot, Dvandeventer, 
Anameofmyveryown, Prakash Nadkami, BAnstedt, Junling, AlanUS, Melcombe, Sphilbrick, Denisarona, Alpapad, Statone, SchreiberBike, 
Cmelan, Aprock, Qwfp, XLinkBot, WikHead, Tayste, Addbot, Kurttg-enwiki, New Image Uploader 929, Luckas-bot, Yobot, Secondsmin- 
utes, AnomieBOT, Ciphers, Materialscientist, Xqbot, Gtfjbl, FrescoBot, X7q, Orubt, Chenopodiaceous, Albertzeyer, Trappist the monk, 
Duoduoduo, Diannaa, RjwilmsiBot, Grumpfel, Strano.m, Mudx77, EmausBot, RMGunton, Alexey.kudinkin, Zephyrus Tavvier, Kchowd- 
hary, SigmaO 1, DASHBotAV, Vldscore, Kjalarr, MhsmithO, Timflutre, Helpful Pixie Bot, Ngocminh.oss, BG19bot, Martha6981, DPL 
bot, BattyBot, Guziran99, Eflatmajor7th, AndrewSmithQueen’s, Jey42, JTravelman, SFK2, Yongli Han, Merespiz, Tentinator, Pkalczynski, 
Yoboho, Hayes.rachel, E8xE8, Tertius51, Nyashinski, ThuyNgocTran, Monkbot, SantiLak, Bluma.Gelley, Kamerondeckerharris, Srp54, 
P.thesling, Alexander Craig Russell, EKaTepima KoHb, Velvel2, Anuragsodhi, Hughkf, PushTheButtonl08, LockemyCock, Gzluyongxi 
and Anonymous: 182 

• Linear discriminant analysis Source: https://en.wikipedia.org/wiki/Linear_discriminant_analysis?oldid=668696480 Contributors: The 
Anome, Fnielsen, XJaM, Edward, Michael Hardy, Kku, Den fjattrade ankan-enwiki, Hike395, Jihg, Deus-enwiki, Nonick, Giftlite, Dun- 
charris, Dfrankow, Pgan002, 3mta3, Arcenciel, Forderud, Crackerbelly, Qwertyus, Rjwilmsi, Mathbot, Predictor, Adoniscik, YurikBot, 
Timholy, Doncram, Tcooke, Shawnc, SmackBot, Maksim-e~enwiki, Slashme, Meld, Memming, Solarapex, Beetstra, Dicklyon, Lifeartist, 
StanfordProgrammer, Petrus Adamus, Sopoforic, Cydebot, Thijs!bot, AlexAlex, AnAj, Mack2, Cpl Syx, Stephenchou0722, Lwaldron, 
R'n'B, G.kunter-enwiki, Nechamayaniger, Daviddoria, SieBot, Ivan Stambuk, Mverleg, Jonomillin, OKBot, Melcombe, Produit, Statone, 
Calimo, Qwfp, Addbot, Mabdul, AndrewHZ, Lightbot, Yobot, Citation bot, Klisanor, Sylwia Ufnalska, Morten Isaksen, Olg wiki, Schnitzel- 
MannGreek, Pcoat, FrescoBot, X7q, Citation bot 1, Wkretzsch, Heavy Joke, Www wwwjsl, Jfmantis, EmausBot, Dewritech, Radshashi, 
Manyu aditya, Marion.cuny, Vldscore, WikiMSL, Helpful Pixie Bot, BG19bot, CitationCleanerBot, Khazar2, Illia Connell, I am One of 
Many, Lcparra, Ashleyleia, Artful Vampire, SJ Defender, ExaTepHHa Kom>, Degill, Olosko and Anonymous: 107 

• Naive Bayes classifier Source: https://en.wikipedia.org/wiki/Naive_Bayes_classifier?oldid=669331569 Contributors: The Anome, Awa- 

terl, Olivier, Michael Hardy, Bewildebeast, Zeno Gantner, Karada, Cyp, Den fjattrade ankan-enwiki, Hike395, Njoshi3, WhisperToMe, 
Toreau, Phil Boswell, RedWolf, Bkell, Wile E. Heresiarch, Giftlite, Akella, JimD, Bovlb, Macrakis, Neilc, Pgan002, MarkSweep, Gene 
s, Cagri, Anirvan, Trevor Maclnnis, Thorwald, Splatty, Rich Farmbrough, Violetriga, Peterjoel, Smalljim, John Vandenberg, BlueN- 
ovember, Jason Davies, Caesura, Oleg Alexandrov, KKramer-enwiki, Btyner, Mandarax, Qwertyus, Rjwilmsi, Hgkamath, Johnnyw, 
Mathbot, Intgr, Sderose, YurikBot, Wavelength, PiAndWhippedCream, CancanlOl, Bovineone, Arichnad, Karipuf, BOT-Superzerocool, 
Evryman, Johndburger, Mebden, XAVeRY, SmackBot, InverseHypercube, CommodiCast, Stimpy, ToddDeLuca, Gilliam, NickGar- 
vey, Chris the speller, OrangeDog, PerVognsen, Can't sleep, clown will eat me, Memming, Mitar, Neshatian, Jklin, Ringger, WMod- 
NS, Tobym, Shorespirit, Mat 1971, Dstanfor, Arauzo, Dantiston, Sytelus, Vera Rita-enwiki, Dkemper, Prolog, Ninjakannon, Jrennie, 
MSBOT, Colfee2theorems, Tremilux, Saurabh911, Robotmanl974, David Eppstein, User Al, HebrewHammerTime, AllenDowney, 
Troos, AntiSpamBot, Newtman, STBotD, Mike V, RJASE1, VolkovBot, Maghnus, Anna Lincoln, Mbusux, Anders gorm, EverGreg, 
Fcady2007, Jojalozzo, Ddxc, Dchwalisz, AlanUS, Melcombe, Headlessplatter, Kotsiantis, Justin W Smith, Motmahp, Calimo, Diane- 
garey, Doobliebop, Alousybum, Sunsetsky, XLinkBot, Herlocker, Addbot, RPHv, Tsunanet, MrOllie, LaaknorBot, Yobot, TaBOT-zerem, 
Twexcom, AnomieBOT, Rubinbot, Smk65536, The Almighty Bob, Cantons-de-l'Est, FrescoBot, X7q, Prolfviktor, Svour- 

droculed, Rickyphyllis, Jonesey95, Geoffrey I Webb, Classifier 1234, Mwojnars, Wingiii, Larry.europe, Helwr, EmausBot, Orphan Wiki, 
Tommy2010, GarouDan, Joseagonzalez, ClueBot NG, Hofmic, NilsHaldenwang, Luoli2000, BG19bot, MusikAnimal, Chafe66, Kavish- 
war.wagholikar, Geduowenyang, Hipponix, Fcbarbi, Librawill, ChrisGualtieri, XMU zhangy, Alialamifard, CorvetteC6RVip, Jamesm¬ 
cmahonO, Tonytonov, Jmagasin, ScienceRandomness, Qingyuanxingsi, Micpalmia, Sofia Koutsouveli, Yuchsiao, Mvdyck, Don neufeld, 
YoniSmolin, Rapanshi, Ananth.sankar.1963, Hmerzic and Anonymous: 184 

• Cross-validation (statistics) Source: https://en.wikipedia.org/wiki/Cross-validation_(statistics)?oldid=668481251 Contributors: Michael 
Hardy, Shyamal, Delirium, Den fjattrade ankan-enwiki, Hike395, Phil Boswell, Cutler, Pgan002, Urhixidur, Discospinster, 3mta3, Bfg, 
Jorg Knappen-enwiki, GregorB, Btyner, Qwertyus, Rjwilmsi, Mattopia, BMF81, Wavelength, Rsrikanth05, Bruguiea, Saric, Zvika, Capi¬ 
talist, SmackBot, Glvgfz, Stimpy, Nbarth, Iridescent, CmdrObot, Olaf Davis, Gogo Dodo, Rphirsch, Blaisorblade, Headbomb, Mgierdal, 
Fogeltje, Alanmalek, AnAj, Onasraou, JAnDbot, Olaf, Necroforest, Johnbibby, Paresnah, Jiuguang Wang, VolkovBot, Jamelan, SieBot, 
Anchor Link Bot, Melcombe, Headlessplatter, Calimo, Skbkekas, Rbeg, XLinkBot, MystBot, Addbot, Sohail stat, Fieldday-sunday, Jo- 
sevellezcaldas, Movado73, Legobot, Yobot, Materialscientist, Citation bot, Xqbot, Georg Stillfried, WaysToEscape, Allion, X7q, Code- 
monkey87, Duoduoduo, Ulatekh, Noblestats, WikitanvirBot, Mjollnir82, ZeroBot, H311Bot, DonnerbO, Robertschulze, ClueBot NG, Widr, 
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Helpful Pixie Bot, Beaumont877, AdventurousSquirrel, ChrisGualtieri, Khazar2, Clevera, Rolf h nelson, Pandadai, Winlose378, Cosmin- 
stamate, Emmanuel-L.T, Degill, Mberming, Kouroshbehzadian and Anonymous: 79 

• Unsupervised learning Source: https://en.wikipedia.org/wiki/Unsupervised_leaming?oldid=660135356 Contributors: Michael Hardy, 
Kku, Alfio, Ahoerstemeier, Hike395, Ojigiri-enwiki, Gene s, Urhixidur, Alex Kosorukolf, Aaronbrick, Bobo 192, 3mta3, Tablizer, De- 
noir, Nkour, Qwertyus, Rjwilmsi, Chobot, Roboto de Ajvol, YurikBot, Darker Dreams, Daniel Mietchen, SmackBot, CommodiCast, 
Trebor, DHN-bot~enwiki, Lambiam, CRGreathouse, Carstensen, Thijs!bot, Jaxelrod, AnAj, Peteymills, David Eppstein, Agenteseg- 
reto, Maheshbest, Timohonkela, Ng.j, EverGreg, Algorithms, Kotsiantis, Auntof6, PixelBot, Edg2103, Addbot, EjsBot, Yobot, Les boys, 
AnomieBOT, Salvamoreno, D'ohBot, Skyerise, Ranjan.acharyya, BertSeghers, EmausBot, Fly by Night, Rotcaeroib, Stheodor, Daryakav, 
Ida Shaw, Chire, Candace Gillhoolley, WikiMSL, Helpful Pixie Bot, Majidjanz and Anonymous: 40 

• Cluster analysis Source: https://en.wikipedia.org/wiki/Cluster_analysis?oldid=667300175 Contributors: The Anome, Fnielsen, Nealmcb, 
Michael Hardy, Shyamal, Kku, Tomi, GTBacchus, Den fjattrade ankan~enwiki, Cherkash, BAxelrod, Hike395, Dbabbitt, Phil Boswell, 
Robbot, Gandalf61, Babbage, Aetheling, Giftlite, Lcgarcia, Cfp, BenFrantzDale, Soundray-enwiki, Ketil, Khalid hassani, Angelo.romano, 
Dfrankow, Gadfium, Pgan002, Gene s, EBB, Sam Hocevar, Pwaring, Jutta, Abdull, Bryan Barnard, Rich Farmbrough, Mathiasl26, Neu- 
ronExMachina, Yersinia-enwiki, Bender235, Alex Kosorukolf, Aaronbrick, John Vandenberg, Greenleaf~enwiki, Ahc, NickSchweitzer, 
3mta3, Jonsafari, Jumbuck, Jerome, Terrycojones, Denoir, Jnothman, Stefan.karpinski, Hazard, Oleg Alexandrov, Soultaco, Woohookitty, 
Linas, Uncle G, Borb, Ruud Koot, Tabletop, Male 1979, Joerg Kurt Wegner, DESiegel, Ruziklan, Sideris, BD2412, Qwertyus, Rjwilmsi, 
Koavf, Salix alba, Michal.burda, Denis Diderot, Klonimus, FlaBot, Mathbot, BananaLanguage, Kcamold, Payo, Jrtayloriv, Windharp, 
BMF81, Roboto de Ajvol, The Rambling Man, YurikBot, Wavelength, Argav, SpuriousQ, Pseudomonas, NawlinWiki, Gareth Jones, 
Bayle Shanks, TCrossland, JFD, Hirak 99, Zzuuzz, Rudrasharman, Zigzaglee, Closedmouth, Dontaskme, Kevin, Killerandy, Airconswitch, 
SmackBot, Drakyoko, Jtneill, Pkirlin, ObjectOl, Meld, Ohnoitsjamie, KaragouniS, Bryan Barnard 1, MalafayaBot, Drewnoakes, Tenawy, 
DHN-bot~enwiki, Iwaterpolo, Zacronos, MatthewKarlsen, Krexer, Bohunk, MOO, Lambiam, Friend of facts, Benash, ThomasHofmann, 
Dfass, Beetstra, Ryulong, Nabeth, Hul2, Iridescent, Ralf Klinkenberg, Madla-enwiki, Alanbino, Origin415, Bairam, Ioannes Pragensis, 
Joaoluis, Megannnn, Nczempin, Harej bot, Slack—line, Playtime, Endpoint, Dgtized, Skittleys, DumbBOT, Talgahli, Thijs!bot, Barticus88, 
Vinoduec, Mailseth, Danhoppe, Phoolimin, Onasraou, Denaxas, AndreasWittenstein, Daytona2, MikeLynch, JAnDbot, Inverse.chi, .ana- 
condabot, Magioladitis, Andrimirzal, Fallschirmjager, JBIdF, David Eppstein, User Al, Eeera, Varan raptor, LedgendGamer, Jiuguang 
Wang, Sommersprosse, Koko90, Smite-Meister, McSly, Dvdpwiki, DavidCBryant, AStrathman, Camm86, TXiKiBoT, RncOOO, Tamas 
Kadar, Mundhenk, Maxim, Winterschlaefer, Lamro, Wheatin, Arrenbas, Sesilbumfluff, Tomfy, Kerveros 99, Seemu, WRK, Drdanl4, 
Harveydrone, Graham853, Wcdriscoll, Zwerglein-enwiki, Osian.h, FghlJklm, Melcombe, Kotsiantis, Freeman77, Victor Chmara, K14m, 
Mugvin, Manuel freire, Boing! said Zebedee, Tim32, PixelBot, Lartoven, Chaosdraid, Aprock, Practical321, Qwfp, FORTRANshnger, 
Sunsetsky, Ocean931, Phantom xxiii, XLinkBot, Pichpich, Gnowor, Sujaykoduri, WikHead, Addbot, Allenchue, DOI bot, Bruce rennes, 
Fgnievinski, Gangcai, MrOllie, FerrousTigras, Delaszk, Tide rolls, Lightbot, PAvdK, Fjrohlf, Tobi, Luckas-bot, Yobot, Gulfera, Hungpuiki, 
AnomieBOT, Flamableconcrete, Materialscientist, Citation bot, Xqbot, Erud, Sylwia Ufnalska, Simeon87, Omnipaedista, Kamitsaha, 
Playthebass, FrescoBot, Sacomoto, D'ohBot, Dan Golding, JohnMeier, Slowmo0815, Atlantia, Citation bot 1, Boxplot, Edfox0714, Mon- 
dalorBot, Lotje, E.V.Krishnamurthy, Capezl, Koozedine, Tbalius, RjwilmsiBot, Ripchip Bot, Jchemmanoor, GodfriedToussaint, Aaron- 
zat, Helwr, EmausBot, John of Reading, Stheodor, Elixirrixile, BOUMEDJOUT, ZeroBot, Sgoder, Chire, Darthhappyface, Jucypsycho, 
RockMagnetist, Wakebrdkid, Fazlican, Anita5192, ClueBot NG, Marion.cuny, Ericfouh, Simeos, Poirel, Robiminer, Michael-stanton, 
Girish280, Helpful Pixie Bot, Novusuna, BG19bot, Cpkex0102, Wikil3, TimSwast, Cricetus, Douglas H Fisher, Mu.ting, ColanR, Cor- 
nelius3, Illia Connell, Compsim, Mogism, Frosty, Abewley, Mark viking, Metcalm, Ninjarua, Trouveur de faits, TCMemoire, ErezHartuv, 
Monkbot, Leegrc, Imsubhashjha, ExaTepHHa KoHb, Olosko, AngelababyOO and Anonymous: 327 

• Expectation-maximization algorithm Source: https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm?oldid= 
671233060 Contributors: Rodrigob, Michael Hardy, Karada, Jrauser, BAxelrod, Hike395, Phil Boswell, Owenman, Robbyjo~enwiki, Ben- 
wing, Wile E. Heresiarch, Giftlite, Paisa, Vadmium, Onco p53, MarkSweep, Piotras, Cataphract, Rama, MisterSheik, Alex Kosorukolf, 
018, John Vandenberg, Jjmerelo-enwiki, 3mta3, Terrycojones, B k, Eric Kvaalen, Cbumett, Finfobia, Jheald, Forderad, Sergey Dmitriev, 
Igny, Bkkbrad, Bluemoose, Btyner, Qwertyus, Rjwilmsi, KYPark, Salix alba, Hild, Mathbot, Glopk, Kri, BradBeattie, YurikBot, Nils 
Grimsmo, Schmock, Regis B., Klutzy, Hakeem.gadi, Maechler, Ladypine, M.A.Dabbah, SmackBot, Meld, Nbarth, Tekhnofiend, Iwater¬ 
polo, Bilgrau, Joeyo, Raptur, Derek fam, Jrouquie, Dicklyon, Alex Selby, Saviourmachine, Lavaka, Requestion, Cydebot, A876, Kallerdis, 
LibroO, Blaisorblade, Skittleys, Andyrew609, Talgalili, Tiedyeina, Rusmike, Headbomb, RobHar, LachlanA, AnAj, Zzpmarco, Deki- 
masu, JamesBWatson, Richard Bartholomew, Livingthingdan, Nkwatra, User Al, Edratzer, Osquar F, Numbo3, Salih, GongYi, Douglas- 
Lanman, Bigredbrain, Market Efficiency, Lamro, Daviddoria, Pine900, Tambal, Mosahgantil.l, Melcombe, Sitush, Pratx, Alexbot, Hbeigi, 
Jakarr, Jwmarck, XLinkBot, Jamshidian, Addbot, Sunjuren, Fgnievinski, LaaknorBot, Aanthonyl243, Peni, Luckas-bot, Yobot, Leonar- 
doWeiss, AnomieBOT, Citation bot, TechBot, Chuanren, FrescoBot, Nageh, Erhanbas, Nocheenlatierra, Qiemem, Kiefer.Wolfowitz, 
Jmc200, Stpasha, Jszymon, GeypycGn, Trappist the monk, Thai Nhi, Ismailari, Dropsciencenotbombs, RjwilmsiBot, Slon02, Emaus¬ 
Bot, Mikealandewar, John of Reading, III, Chire, Statna, ClueBot NG, Rezabot, Meea, Qwerty9967, Helpful Pixie Bot, Rxnt, Bibcode 
Bot, BG19bot, Chafe66, Whym, Lvilnis, BattyBot, Yasuo2, Elia Connell, JYBot, Blegat, Yogtad, Tentinator, Marko0991, Ginsuloft, Wcc- 
snow, Ronniemaor, Monkbot, Nboley, Faror91, DilumA, Rider ranger47, Velvel2, Crimsonslide, Megadata tensor, Surbut, Greatwave and 
Anonymous: 151 

• K-means clustering Source: https://en.wikipedia.org/wiki/K-means_clustering?oldid=671161262 Contributors: Fnielsen, Michael Hardy, 
Ixfd64, Den fjattrade ankan~enwiki, Charles Matthews, Dbabbitt, Phil Boswell, Ashwin, Pengo, Giftlite, BenFrantzDale, Dunchar- 
ris, Soren.harward, WorldsApart, Ratiocinate, Gazpacho, Rich Farmbrough, Mathiasl26, Greenleaf-enwiki, 3mta3, Jonsafari, Andkaha, 
Ricky81682, Jnothman, Alai, Robert K S, Qwertyus, Rjwilmsi, Hgkamath, Miserlou, Gringer, Mathbot, Mahlon, Chobot, Bgwhite, Uk- 
Paolo, YurikBot, Wavelength, SpuriousQ, Annabel, Hakkinen, SamuelRiv, Leishi, Killerandy, SmackBot, Zanetu, Mauls, Meld, Memming, 
Cronholml44, Barabum, Denshade, Mauro Bieg, CBM, Chrike, Chrisahn, Talgalili, Thijs!bot, June8th, N5iln, Headbomb, Nick Number, 
Phoolimin, Sanchom, Charibdis, Smartcat, Magioladitis, David Eppstein, Kzafer, Gfxguy, Turketwh, Stimpak, Mati22081979, Alim, 
JohnBlackbume, TXiKiBoT, FedeLebron, ChrisDing, Corvus comix, Ostrouchov, Yannisl962, Billinghurst, Maxhttle2007, Emiepan, 
Illuminated, Strife911, Weston.pace, Ntvuok, AlanUS, Melcombe, PerryTachett, MenoBot, DEEJAY JPM, DragonBot, Alexbot, Pot, 
Tbmurphy, Rcalhoun, Agorl53, Qwfp, Niteskate, Tavlos, Avoided, Addbot, DOI bot, Foma84, Fgnievinski, Homncrase, Wfolta, An- 
dresH, Yobot, AnomieBOT, Jimll38, Materialscientist, Citation bot, LilHelpa, Honkkis, Gtfjbl, Gilol969, Simeon87, Woolleynick, 
Wonderful597, Dpf90, Foobarhoge, FrescoBot, Dan Golding, Phillipe Israel, Jonesey95, Cincoutprabu, Amkilpatrick, NedLevine, Ranu- 
mao, Larry.europe, Helwr, EmausBot, John of Reading, Lessbread, Manyu aditya, ZeroBot, Sgoder, Chire, Toninowiki, OsmOsmO, 
Helpsome, ClueBot NG, Mathstat, Jack Greenmaven, Railwaycat, BlueScreenD, Jsanchezalmeida, BG19bot, MusikAnimal, Mark Ar- 
sten, SciCompTeacher, Chmarkine, Utacsecd, Amritamaz, EdwardH, Sundirac, BattyBot, Illia Connell, MarkPundurs, MindAfterMath, 
Jamesxl2345, MEmreCelebi, Jcallega, Watarok, E8xE8, Quenhitran, Anrnusna, MSheshera, Monkbot, Mazumdarparijat, Joma.huguet, 
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Niraj Aher, Alvisedt, Laiwoonsiu, Eyurtsev, HelpUsStopSpam, Varunjoshi42 and Anonymous: 199 

• Hierarchical clustering Source: https://en.wikipedia.org/wiki/Hierarchical_clustering?oldid=670589093 Contributors: Jose Icaza, 
Nealmcb, GTBacchus, Hike395, Dmb000006, 3mta3, Mandarax, Qwertyus, Rjwilmsi, Piet Delport, Hakkinen, DoriSmith, Smack- 
Bot, Mitar, Mwtoews, Krauss, Skittleys, Talgalili, Headbomb, Magioladitis, David Eppstein, CypherzeroO, Salih, FedeLebron, Kr- 
ishna.91, Grscjo3, Qwfp, Eric5000, SleightTrickery, MystBot, Addbot, Netzwerkerin, Yobot, Legendre 17, AnomieBOT, GrouchoBot, 
FrescoBot, Iamtravis, Citation bot 1, DixonDBot, Ismailari, Saitenschlager, Robtothl, NedLevine, RjwilmsiBot, WikitanvirBot, Jackiey99, 
Jyl9870110, ZeroBot, Chire, Arsl2345, Sgj67, Mathstat, Widr, KLBot2, Kamperh, SciCompTeacher, IluvatarBot, SarahLZ, Astros4477, 
Jmajf, Joeinwiki, PeterLFlomPhD, StuartWilsonMaui, Meatybrainstuff, EKaTepHHa Kohb and Anonymous: 50 

• Instance-based learning Source: https://en.wikipedia.org/wiki/Instance-based_learning?oldid=615580426 Contributors: Ehamberg, Qw¬ 
ertyus, SmackBot, Hmains, AlanUS, RjwilmsiBot, Garfieldnate, LeviShel, Verhoevenben, Mann.timothy, ChrisGualtieri and Anonymous: 
5 

• K-nearest neighbors algorithm Source: https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm?oldid=665873983 Contributors: 
The Anome, B4hand, Michael Hardy, Ronz, Charles Matthews, Topbanana, AnonMoos, Pakaran, Robbot, Altenmann, DHN, Adam 
McMaster, Pgan002, Dan aka jack, Thorwald, Rama, Slambo, Barro-enwiki, BlueNovember, Caesura, GiovanniS, RHaworth, SQF- 
reak, Btyner, Marudubshinki, BD2412, Qwertyus, Rjwilmsi, Stoph, Debivort, Wavelength, Janto, Garion96, SmackBot, CommodiCast, 
Mdd4696, Stimpy, Meld, DHN-bot~enwiki, Hongooi, MisterHand, Joerite, Memming, Gnack, Hul2, Atreys, Ogerard, Kozuch, AnAj, 
MER-C, Olaf, Jboml, Peteymills, Dustinsmith, User Al, Mach7, McSly, AntiSpamBot, RJASE1, Joeoettinger, TXiKiBoT, ITurtle, Mpx, 
SieBot, Prakash Nadkarni, Flyer22, Narasimhanator, AlanUS, Melcombe, Eamon Nerbonne, Svante 1, Cibi3d, ClueBot, JP.Martin-Flatin, 
Algomaster, Alexbot, Agorl53, El bot de la dieta, Rubybrian, Pradtke, XLinkBot, Ploptimist, Addbot, MrOllie, Protonk, Luckas-bot, 
Yobot, AnomieBOT, Tappoz, Citation bot, Megatang, Miym, PM800, Leonid Volnitsky, FrescoBot, Paine Ellsworth, X7q, Citation bot 
1, Rickyphyllis, Emslo69, Lars Washington, Dinamik-bot, Bracchesimo, Delmonde, Geomwiz, Sideways713, DARTH SIDIOUS 2, Thed- 
wards, RjwilmsiBot, Larry.europe, GodfriedToussaint, Nikolaosvasiloglou, EmausBot, Logical Cowboy, Fly by Night, Wijobs, Microfries, 
Slightsmile, Manyu aditya, Meng6, Chire, Yc319, Mlguy, Vedantkumar, Lovasoa, Dennis97519, Chafe66, Hipponix, Luvegood, Chris¬ 
Gualtieri, Vbaculum, Jamesxl2345, Joeinwiki, Sevamoo, TejDham, Comp.arch, LokeshRavindranathan, Skrl5081997, Monkbot, Niraj 
Aher, Moshe.benyamin, Crystallizedcarbon, Vermelhomarajo, Sachith500 and Anonymous: 114 

• Principal component analysis Source: https://en.wikipedia.org/wiki/Principal_component_analysis?oldid=670832593 Contributors: 
Ed Poor, Fnielsen, Schewek, Bemfarr, Michael Hardy, Shyamal, Wapcaplet, Ixfd64, Tomi, Jovan, CatherineMunro, Den fjattrade 
ankan-enwiki, Kevin Baas, Cherkash, Hike395, A5, Guaka, Dcoetzee, Ike9898, Jfeckstein, Jessel, Sboehringer, Vincent kraeutler, 
Metasquares, Phil Boswell, Npettiaux, Benwing, Centic, SmblOOl, Saforrest, Giftlite, BenFrantzDale, Lupin, Chinasaur, Amp, Yke, Ja¬ 
son Quinn, Khalid hassani, Dfrankow, Pgan002, Gdm, Fpahl, OverlordQ, Rcs~enwiki, Gene s, Lumidek, Jmeppley, Frau Holle, David - 
strauss, Thorwald, Richie, Discospinster, Rich Farmbrough, Pjacobi, Bender235, Gauge, Mdf, Nicolasbock, Lysdexia, Anthony Appleyard, 
Denoir, Jason Davies, Eric Kvaalen, BernardH, Pontus, Jheald, BlastOButter42, Falcorian, Jfr26, RzR~enwiki, Waldir, Kesla, Ketiltrout, 
Rjwilmsi, Andy Kali, FlaBot, Winterstein, Mathbot, Itinerant 1, Tomer Ish Shalom, Chobot, Adoniscik, YurikBot, Wavelength, Pmg, Vecter, 
Freiberg, HenrikMidtiby~enwiki, Bruguiea, Trovatore, Holon, Jpbowen, Crasshopper, Entropeneur, Bota47, SamuelRiv, DaveWF, JCipri- 
ani, H@r@ld, Whaa?, Zvika, Lunch, SmackBot, Slashme, Larry Doolittle, Jtneill, Mdd4696, Wikipedia@natividads.com, Meld, Misfeldt, 
Njerseyguy, AhmedHan, Oh Filth, Metacomet, Mihai preda, Tekhnofiend, Huji, Tamfang, KjetillOOl, Dr. Crash, Vina-iwbot~enwiki, 
Ck lostsword, Thejerm, Lambiam, Mgiganteusl, Ben Moore, Dicklyon, Hovden, Nwstephens, Eclairs, Hul2, Luwo, Conormct, Dound, 
Mishrasknehu, Denizstij, CRGreathouse, Shorespirit, MaxEnt, MC10, Hypersphere, Indeterminate, Carstensen, Marklutfel, Seicer, Tal¬ 
galili, RichardVeryard, MaTT~enwiki, Javijabot, Dr. Submillimeter, Tillman, GromXXVII, MER-C, JPRBW, .anacondabot, Sirhans, 
Meredyth, Brusegadi, Daemun, Destynova, A Hauptfleisch, User Al, Parunach, BlackcatlOO, Zefram, R'n'B, Jorgenumata, Jiuguang 
Wang, McSly, GongYi, Robertgreer, Qtea, Swatiquantie, VasilievW, GcSwRhlc, ChrisDing, Amaher, Slysplace, Jmath666, Peter ja 
shaw, Sjpajantha, Ericmelse, SieBot, ToePeu.bot, Rllrll, Smsarmad, Oxymoron83, Algorithms, AlanUS, Tesil700, Melcombe, Headless- 
platter, DonAByrd, Vectraproject, Anturtle, ClueBot, Ferred, HairyFotr, Mild Bill Hiccup, Robmontagna, Dj.science, SteelSoul, Cal- 
imo, NuclearWarfare, Skbkekas, Gundersen53, Agorl53, SchreiberBike, Aprock, Ondrejspilka, JamesXinzhiLi, Userl02, StevenDH, 
Kegon, HarrivBOT, Qwfp, XLinkBot, Dkondras, Kakila, Kbdankbot, Tayste, Addbot, Bruce rennes, MrOllie, Delaszk, Mdnahas, Light- 
hot, Legobot, Luckas-bot, Yobot, Crisluengo, AnakngAraw, Chosesdites, Archy33, AnomieBOT, Ciphers, T784303, Citation bot, 
Fritsebits, Xqbot, Gtfjbl, Sylwia Ufnalska, Chuanren, Omnipaedista, BulldogBeing, Soon Lee, Joxemai, MuellerJak, Amosdor, Fres¬ 
coBot, Rdledesma, X7q, BenzolBot, Gaba p, Pinethicket, Dront, Hechay, Duoduoduo, Jfmantis, PCAexplorer, RjwilmsiBot, Kastchei, 
Helwr, Alfaisanomega, Davoodshamsi, GoingBatty, Fran jo, ZeroBot, Josve05a, Drusus 0, Sgoder, Chire, GeorgeBamick, Mayur, Fjoel- 
skaldr, JordiGH, RockMagnetist, Brycehughes, ClueBot NG, Marion.cuny, Ldvbin, WikiMSL, Helpful Pixie Bot, Roybgardner, Nagarajan 
paramasivam, BG19bot, Naomi altman, Chafe66, Ga29sic, JiemingChen, Susie8876, Statisfactions, SarahLZ, Cretchen, Fylbecatulous, 
Imarkovs, Dfbeaton, Cccddd2012, BereNice V, ChrisGualtieri, GoShow, Aimboy, Jogfallsl947, Stevebihings, Duncanpark, Lugia2453, 
The Quirky Kitty, Germanoverlord, SimonPerera, Gabelglesia, Paum89, HalilYurdugul, OhGodltsSoAmazing, Sangdon Lee, Tbouwman, 
Pohne3939, Pandadai, Tmhuey, Pijjin, Hdchina2010, Chenhow2008, Themtide999, Statistix35, Phlegl, Hilary Hou, Mehr86, Monkbot, 
Yrobbers, Bzeitner, JamesMishra, Uprockrhiz, Cyrilauburtin, Potnisanish, Velvel2, Mew95001, CarlJohanl, Olosko, Wanghe07, Embat, 
Ben.dichter, Olgreenwood and Anonymous: 344 

• Dimensionality reduction Source: https://en.wikipedia.org/wiki/Dimensionality_reduction?oldid=620805391 Contributors: Michael 
Hardy, Kku, William M. Connolley, Charles Matthews, Stormie, Psychonaut, Texture, Wile E. Heresiarch, Wolfkeeper, Pgan002, Neu- 
ronExMachina, Euyyn, Runnerl928, Arthena, Zawersh, Oleg Alexandrov, Waldir, Joerg Kurt Wegner, Qwertyus, Ddofer, Tagith, Bg- 
white, YurikBot, Wavelength, Soumya.ray, Welsh, Gareth Jones, Voidxor, SmackBot, Meld, Charivari, Kvng, Laurens-af, CapitalR, 
Shelf Skewed, BetacommandBot, Barticus88, Sylenius, Mentifisto, Dougher, Xetrov, SieBot, Kerveros 99, Hegh, Melcombe, Agorl53, 
BOTarate, Lespinats, Addbot, Delaszk, tiLo-’ Movado73, Yobot, Fc renato. Ciphers, FrescoBot, Sa'y, Jonkerz, Helwr, ClueBot NG, 
WikiMSL, Helpful Pixie Bot, Craigacp, Cccddd2012, HurriH, OhGodltsSoAmazing, Diman.kham and Anonymous: 46 

• Greedy algorithm Source: https://en.wikipedia.org/wiki/Greedy_algorithm?oldid=667684717 Contributors: AxelBoldt, Hfastedge, 
CatherineMunro, Notheruser, PeterBrooks, Charles Matthews, Dcoetzee, Malcohol, Jaredwf, Meduz, Sverdrup, Wlievens, Enochlau, 
Giftlite, Smjg, Kim Bruning, Jason Quinn, Pgan002, Andycjp, Andreas Kaufmann, TomRitchford, Discospinster, ZeroOne, Nabla, Dio- 
midis Spinellis, Nandhp, Obradovic Goran, Haham hanuka, CKlunck, Swapspace, Ralphy-enwiki, Ryanmcdaniel, Hammertime, Mechon- 
barsa, CloudNine, Mindmatrix, LOL, Cruccone, Ruud Koot, Que, Sangol23, FlaBot, New Thought, Kri, Pavel Kotrc, YurikBot, Wave¬ 
length, Hairy Dude, TheMandarin, Nethgirb, Bota47, Marcosw, Darrel francis, SmackBot, Brianyoumans, Unyoyega, KocjoBot~enwiki, 
NickShaforostoff, Trezatium, SynergyBlades, DHN-bot~enwiki, Emurphy42, Omgoleus, MichaelBillington, Mlpkr, Wleizero, Cjohnzen, 
Mcstrother, Suanshsinghal, Cydebot, Jibbist, Thijs!bot, Wikid77, Nkarthiks, Escarbot, Uselesswarrior, Clan-destine, Salgueiro-enwiki, 
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Chamale, Jddriessen, Albany NY, Magioladitis, Eleschinski2000, Avicennasis, David Eppstein, MangeOl, Zangkannt, Policron, BernardZ, 
Maghnus, TXiKiBoT, ArzelaAscoli, Monty845, Hobartimus, Denisarona, HairyFotr, Meekywiki, Enmc, Addbot, Legobot, Fraggle81, 
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